In this blog I’d like to cover every aspect of parsing Office binary documents, and do it in less than a thousand words.  But, what follows is more realistic, thus more narrow in focus.  Specifically, I’ll examine the PowerPoint binary format from the point of view of parsing/enumerating “Pictures”.  PowerPoint Pictures are found in the Pictures Stream (as you will see below).  By contrast, Shape objects (rectangles, squares, lines, etc) do not exist in the Pictures Stream.  So the discussion and details of the enumeration of Shape objects may be the subject of a future blog.

 

There are several Office binary file formats defined as part of the Open Specification document set and you can find them detailed here.  The one for which I’ll focus my angle of approach is the PowerPoint binary file format specification (MS-PPT).  And, since you can’t find Office pictures in a binary document without referring to the MS-ODRAW specification, much of the details of this blog will derive from definitions of structures in MS-ODRAW.  Since this analysis is specific to “Pictures”, which means an actual inserted picture in the PowerPoint document, to follow along you may refer to MS-PPT section 2.1.3 Pictures Stream: An optional stream whose name MUST be "Pictures".  The contents of the Pictures Stream are defined by the OfficeArtBStoreDelay record as specified in MS-ODRAW section 2.2.21.

 

Looking at MS-ODRAW section 2.2.21 OfficeArtBStoreDelay: This record specifies the delay loaded container of BLIPs in the host application. There is no OfficeArtRecordHeader for this container.

rgfb (variable): An array of OfficeArtBStoreContainerFileBlock records that specifies BLIP data.  The array continues while the rh.recType field of the OfficeArtBStoreContainerFileBlock record is equal to 0xF007 or between 0xF018 and 0xF117, inclusive.

 

For an example, I created a very simple MS-PPT document and inserted (Menu Ribbon -> Insert Tab) a Picture and a ClipArt object.  Based on the above specification details I’ll enumerate the image objects, as shown below (Hex file view after opening the Pictures Stream and delving to the OfficeArtBStoreContainerFileBlock).

 

You’ll find the color coded details below, which explain this hex data block.

A0 46 1D F0 AE 03 00 00 B1 A6 19 08 D8 C3 0B 6F //image data follows

B5 6C A3 98 8C 9E F4 65 FF FF D8 FF E0 00 10 4A

…//skipping forward to the next record

60 21 1B F0 38 6D 00 00 27 CF 5A 3B 2E DC 9E 2E //image data follows//

16 A1 A1 59 52 1B 76 E9 AC A7 00 00 DC F7 FF FF

 

The first type of image object shown above is the JPEG I inserted, as shown in MS-ODRAW section 2.2.22 OfficeArtBStoreContainerFileBlock:

Value            Meaning

0xF007           OfficeArtFBSE record

0xF018 – 0xF117  OfficeArtBlip record

 

I am only dealing with OfficeArtBlip records in my example, so MS-ODRAW section 2.2.23 OfficeArtBlip (note color coded hex dump above):

Value  Meaning

0xF01A OfficeArtBlipEMF  record

0xF01B OfficeArtBlipWMF  record

0xF01C OfficeArtBlipPICT record

0xF01D OfficeArtBlipJPEG record

0xF01E OfficeArtBlipPNG  record

0xF01F OfficeArtBlipDIB  record

0xF029 OfficeArtBlipTIFF record

0xF02A OfficeArtBlipJPEG record

 

The first record is the OfficeArtBlipJPEG, MS-ODRAW section 2.2.27 OfficeArtBlipJPEG:

rh.recVer      MUST be 0x0.

rh.recInstance Specified in the following table.

rh.recType     MUST be 0xF01D.

rh.recLen      An unsigned integer that specifies the number of bytes following the header.

MUST be the size of BLIPFileData plus 17 bytes if recInstance is 0x46A or 0x6E2 or the size of BLIPFileData plus 33 bytes if recInstance is 0x46B or 0x6E3.

 

0x46A    JPEG in RGB color space

0x46B    JPEG in RGB color space

0x6E2    JPEG in CMYK color space

0x6E3    JPEG in CMYK color space

 

Since 0x3AE = 942 bytes you would read the header record and note the type of the record (0xF01D) and skip forward 942 bytes to the next 8 byte header record and repeat while you had records to read in the Pictures Stream.  In this example, I have only two records, and would read the next record as 0xF01B OfficeArtBlipWMF, which is length of 0x6D38 = 27,960 bytes.

 

The following sample code snippet demonstrates how to delve the stream structures with IStorage.  You may use this as a simple starting point in your investigation of the Office binary file structures.  (Note, of course, you will want to implement with error handling and other coding best practices.  This is only a small snippet I used in this example to investigate the stream structures with IStorage)

 

int main(int argc, char* argv[])

{

      HRESULT hr;

      IStorage *pStg = NULL;

      CoInitialize(NULL);

 

      hr = StgOpenStorage(

            L"<path to file>\\powerpoint-file.ppt”,

            NULL,STGM_READ | STGM_SHARE_EXCLUSIVE,NULL,NULL,&pStg);

 

      if (!FAILED(hr))

      {

            IEnumSTATSTG *pEnumStat = NULL;

            pStg->EnumElements(NULL,NULL,NULL,&pEnumStat);

            DWORD dwFetched;

            STATSTG stat;

 

            while (pEnumStat->Next(1,&stat,&dwFetched) == S_OK)

            {

                  if (!wcscmp(L"Pictures",stat.pwcsName))

                  {

                        IStream *pStm = NULL;

                        hr = pStg->OpenStream(

                              stat.pwcsName,

                              NULL,STGM_READ | STGM_SHARE_EXCLUSIVE,NULL,&pStm);

 

                        // add processing here

 

                        pStm->Release();

                  }

            }

            pEnumStat->Release();

            pStg->Release();

      }

      CoUninitialize();

      return 0;

}

 

Note the details of the stat structure in the code snippet above:

Stat              {pwcsName=0x006d3708 "Pictures" type=2 cbSize={...} ...} tagSTATSTG

pwcsName          0x006d3708 "Pictures"               wchar_t *

type              2                                   unsigned long

cbSize            {28918}                             _ULARGE_INTEGER

mtime             {dwLowDateTime=0 dwHighDateTime=0 } _FILETIME

ctime             {dwLowDateTime=0 dwHighDateTime=0 } _FILETIME

atime             {dwLowDateTime=0 dwHighDateTime=0 } _FILETIME

grfMode           0                                   unsigned long

grfLocksSupported 0                                   unsigned long

clsid             {GUID_NULL}                         _GUID

grfStateBits      0                                   unsigned long

reserved          0                                   unsigned long

 

To expand on the preceding code, since you have the length of each picture and the image type, you could read the bytes into a byte array and write them out to disk, or whatever your requirements may entail.

 

I hope this helps shed some light on parsing Office binary files in general and how you might approach parsing PowerPoint binary files for pictures.