Object-embedding in OOXML with Microsoft Office 2007

by jlundstocholm 12. April 2008 17:10

(updated 2008-04-14, added links to external resources) 

Now that the ISO-vote and approval of OOXML is done with, it is time to continue the coverage of implementing OOXML as well as ODF – this time about OOXML, Microsoft Office 2007 and embedded objects.

As I have previously said, there are always quirks when it comes to implementations of any standard in large applications. I have covered a few of these already regarding mathematical content [0], [1] and it is no different with regards to object embedding. I should say that a source of inspiration to this article was Stepháne Rodrigues’ article about binary Parts of an OOXML-file (OPC-package).

Now, embedding objects in an OOXML-file is pretty straight-forward: Simply add the object somewhere in the package and make a reference to the location and specify what kind of file you are embedding. This is very similar to how it is done in ODF.

(note: the specific schema-fragments defining how to do this were dealt with and changed at the BRM, so I will not include these until the final version of IS 29500 is released. I will update this article according to the revised spec).

As I have noted earlier, interoperability happens at application-level, so it is worth pondering a bit on how the specification is implemented in the major implementations of it. So let’s see how Microsoft Office acts when embedding objects.

What I did was this: 

I used Microsoft Office 2007, created a text-document and I embedded an object in it – in this case an OpenOffice.org Calc Spreadsheet. The spreadsheet is also inspired by one of Stepháne Rodrigues’ articles, the infamous “OOXML is defective by design”.

 

The object is inserted and displayed in the document. When activating the object, I can edit it as if it was in OOo Calc itself. Actually it is OOo Calc itself. It is invoked using OLE and as a side-note it shows a cool thing about OLE – or similar other object linking techniques. Microsoft Office 2007 does not know anything about OpenOffice.org, yet it is still able to invoke the application and edit the embedded object.

 

Ok – now let’s look at the OOXML-file created. In the file document.xml the following fragment is located:


The <v:shape>-element is part of the nasty VML-dependency that luckily was dealt with at the BRM. This will be replaced by DrawingML in the final IS 29500. The <o:OLEObject>-element specifies the type of the embedded object (“opendocument.CalcDocument.1”) and the location of it (“rId5”). There is really nothing platform dependent here in the OOXML-markup.What is more interesting, though, is looking at the Calc-object after it is embedded. By navigating through the relationship-model of the OPC-package, the embedded object is located.

 

One might think that this file was simply the Calc-file renamed, but sadly this is not so. This file is actually the Calc-file wrapped in an OLE2 Compound file (“CF”). The CF-file is basically a stream wrapper which allows a number of streams to be persisted in a file as well as information about these streams. Using one of the many CF-viewers you can get the data of the wrapped file itself as well as the persisted information of it, here “com.sun.star.comp.Calc.SpreadsheetDocument _   Embedded Object _   opendocument.CalcDocument.1”.

 

 

Technically this is really not a big deal – there are well-known ways to manipulate these files on all platforms and most programming languages and extracting the required data should really be a no-brainer. OpenOffice.org is licensed under LGPL, so you can use the source-code from this to figure out how to do it on the platforms supported by OpenOffice.org. It is also pretty evident why Microsoft Office 2007 works this way. Microsoft Office 2007 is the latest incarnation of the Microsoft Office Suite – a suite that has depended on this file format since at least 1999 … and of course on OLE itself as well. So if you want to implement a document consumer, this is simply something to be aware of when consuming OOXML-files.

From the perspective of a developer, however, this is really annoying. I would definitely opt for Microsoft Office 2007 embedding the objects simply as the objects they are – and not wrapping them in a CF-wrapper. This is how it is done in OpenOffice.org. Granted, this suite does other weir(d) things like renaming the files and not being entirely clear how to embed all object types, but the objects are embedded as they are (unless they are OpenDocument objects). This is a benefit to me as a developer when examining OOXML-files, because I can simply extract the object in question from the document package and verify the file.

So this might be the first new post-vote change-modification to IS 29500:

 

When embedding objects an application shall not modify or wrap the embedded object in any way before embedding it in the package. When a document consumer encounters an embedded object, this shall not be converted to another object type without knowledge-based confirmation by the user.

 

This (or similar woring in standard-lingo) would prevent Microsoft Office in wrapping objects on CF-wrappers, but it would also prevent applications like OpenOffice.org on SUSE to convert embedded Excel-objects to Calc-spreadsheets. FYI, this kills interop too.

A final request: Microsoft, please, as you must already be implementing the changes from the BRM for Office 2007, would you be so kind to make this change to the application as well? It should really be a no-brainer, and if there should be any requirements in your code for the CF-files, feel free to load the objects, wrap them in an in-memory CF-file and take it from there.

Smile

Comments

4/12/2008 9:03:27 PM #

hAl

Good piece.
The whole compound file format embedding is useless now that OOXML is already in itself is a compound file which can contain multiple embeded files in their own directory structure.

I am not sure if it would be a very relevant thing to change to the standard (as potentially OOXML and ODFs zipfiles in itself can be considered a wrapper) but it would be relevant for MS Office to just dump that compound file method.

hAl |

4/13/2008 2:12:34 AM #

orcmid

I'm curious about the other reasons for the packaging that OLE does for the persistent form of the embedded material.  It strikes me that some sort of container is needed (or else multiple embeddings and relationships) because MS Office also carries a rendering of the object for display in the event that the linked application is not present for doing the rendering.  

I am drawing on a very old exploration of OLE, but I thought that was one reason that the embedding material is wrapped up in a container.  I think the other reason is that Office doesn't actually know what the original file is.  It uses OLE to ask the "host" application (viewed as a server in this case) for a persistent rendering of a linking or an embedding so that it can be re-incarnated later.  The embedding might be of only a fragment of the original document and not be in a form that is a document of the kind that the host puts in files.  The retention of a persistent presentation as well for in case the MS Office document is opened on a computer where the correct host application (version) is not available.

I suppose one can work on this via alternate renditions, but it is important to understand that Microsoft Office, when doing an OLE linking or embedding, doesn't know anything about the file this is all coming from, and if it did, it would be very quirky to include a gigunda Excel file from which a single chart is being linked/embedded.

[I love instant design too.]

OK, having raised all of that, I think this is a great analysis and the kinds of discussions that need to be held to find out ways to harmonize these document formats. And however this is done, there needs to be a way to figure out what part of the linking/embedding is actually being used in the inclusion and how it is to appear and who can edit it in that context.

orcmid United States |

4/14/2008 6:18:23 PM #

jlundstocholm

Dennis,

I am glad you like the article Smile

About the rendering of the object, I cannot say if CF-files (and in OLE-files particular) have a "representation" of the object enmbedded, but I would seriously doubt it. The representation is more likely to be embedded in the [i]document format]/i] than in the container itself. Remember that the CF-file (or OLE2 Compound File) is merely a wrapper around an OLE-object - which itself is nothing more than an object implementing a particular interface. The OOXML-format itself carries a "thumbnail" of the object so there is no need for the OLE-wrapper - at least not as far as I can see.

About the type of the object, the CF carries this with it in one of the parts (streams) but this information is also in OOXML - the ProgId-attribute is intented to contain just this. As you can see with the DocFile-view of the CF-file it matches to the contents of this attribute. Also in OOXML there is a ContentType-specification of the part in question, and this could/should be used to further specify the object embedded. Sadly Office 2007 uses a generic ContentType for this, so no real information is available here.

To any MS-guys reading this: What is the process of specifying which ContentType to use with which kind of object in OOXML? Is it done by "market-consensus" or are there a more formal way of doing this?

jlundstocholm Denmark |

4/15/2008 2:22:06 AM #

orcmid

I just blew it.  Wrote a comment on this and then closed the browser window by mistake. Yuk.

Here's what puzzles me.  My thinking is that a big part of the persistent binary material that Office stores for an embedding is actually delivered over an OLE interface from the host of the embedding.  That is, I think OfficeOrg Calk governs what is delivered, and it has to be different than the original raw file so that the embedding can be reconstituted properly if the user chooses to edit it or refresh it (in the case of a linking).  

In the past, there has also been a WMF-format capture of the rendition of the embedding (not a thumbnail but the actual view), and that will be found somewhere in the way Office wraps things up and stashes it inside the Office file.  

I suppose this is another reason why the Compound Binary File format and WMF format specifications were both published under the Open Specification Promise as part of the recent release of the Office binary formats for Word, Excel, and PowerPoint.

In your investigations, have you confirmed exactly how much of this is opaque to Office and determined on the host (OpenOffice) side of the OLE connection and how much is further wrapping by Office to provide a bundle that it handles the same for any OLE embedding?

orcmid United States |

4/15/2008 5:47:14 AM #

Dan

orcmid is correct. Office 2007 does not modify the embedded object at all. The only method of cross-app communication for OLE embedding is an IStorage. Office 2007 persists this IStorage into the OOXML file AS-IS.

Dan United States |

4/18/2008 2:04:19 AM #

pingback

Pingback from blogs.msdn.com

Doug Mahugh : Open XML links for 04-17-2008

blogs.msdn.com |

4/18/2008 10:29:21 PM #

jlundstocholm

Dan,

I didn't mean to imply that Office 2007 changed the embedded object (stream) when wrapping the stream in a CF-wrapper. As you say the stream is wrapped as is in a CF-wrapper which basically is persisting an object implementing the iStorage-interface. I know that OLE requires implementation of the iStorage interface to be used with OLE itself, but OLE is not required for cross app interop - as you know it works just fine in ODF where OLE is not required on the hosting platform either.

Anywy, I'm a bit low on deep knowledge of this so please let me know any mistakes and feel free to contribute any new information.

Smile

jlundstocholm United Kingdom |

4/19/2008 2:42:51 AM #

orcmid

I see.  You are arguing for a different kind of embedding in OOXML, rather than the OLE case.  Considering the OLE case will not go away any time soon, at least in MS Office, I wonder how effective this will work out to be.

I am curious about the reverse experiment (on Windows, at least), using an OLE embedding with OO.o.  There are a variety of provisions for OLE in the ODF specification (under draw: and presentation: and animation, with mention under packaging too).  

I'm also curious whether the full-file embedding that you observed was in an implementation (i.e., OO.o) and how is it tied to ODF.  (I suppose this question is about how is the embedding referenced from the main XML document.)  

I can see that this is a harmonization concern, for sure.  My attention is drawn to this intriguing statement:

"Application that support objects should support linking to objects that are contained within the same package. They may also support linking to object located outside the package."

(PDF sheet 308 of ISO 26300:2006 under Object Data in section 9.3.3 Objects.)

ODF also supports Java Applets (with about as much little detail), which is further intriguing.

orcmid United States |

4/20/2008 3:56:13 AM #

Jesper Lund Stocholm

Dennis,

Let me see if I can clarify what my reasoning is:

If you decide to embed some object in some document format, you basically only need to provide two things:

1: A location of the object in the "document"
In OOXML and ODF this means the "path" to the sub-stream in the ZIP-archive

2: An indication of which type of object you are embedding
In OOXML (and more or less in ODF as well) this is done using a content type and (in OOXML) a "ProgId".

What a consuming application does with the object, how it communicates with it, how it manipulates the content of the object, how the object is presented to the user etc - these are all application specific behaviours that will differ from application to application.

What Microsoft Office 2007 does is that it wrappes the object being embedded in an CF-wrapper to make it more "pleasing" to Microsoft Office 2007 (and other OLE-enabled OS/application).

The reason why I think this is a bad choice is that it does not exactly make interop any easier. I realise that a consuming application simply needs to remove the CF-wrapper from around the wrapped stream (which should technically not pose any problems), but I still don't like it. There are worse cases of "interop-barriers" out there, though. I once tested Lotus Notes 8 on SLED where we looked at interop using ODF-files with embedded Excel-spreadsheets (see idippedut.dk/post/2007/11/Whats-up-with-OLE.aspx) and LN8 actually (upon prompting the user with a question no ordinary user will ever be able to answer) converts the embedded Excel-spreadsheet to a ODF-spreadsheet and persists this in the ODF-package instead. By this they basically kill the possibility of round-tripping the document, but hey - at least they made the document "all ODF".

Smile

Jesper Lund Stocholm Denmark |

4/20/2008 12:26:17 PM #

orcmid

I think there is a confusion of other kinds of embedding with OLE embedding here.

I am not going to dispute whether there are other ways to do embedding.  Instead I just want to have clarity on what OLE embedding means in both OOXML (where it is supported) and in ODF, where it is supported.  

I did the following things:

1. I opened a spreadsheet that I use in Excel 2003.  I use the compatibility pack to load and save it as an .xls, but as far as Excel 2003 OLE functions go, it doesn't know about the .xls and only deals with its internal format (which is never on the disk).

2. I opened OpenOffice.org Writer 2.3 for Windows (Sun Microsystems distribution) and started a document describing what I was doing.

3. I selected a section of the Excel spreadsheet representing last week's daily fitness records for me.  I copied the selection to the clipboard.  I used Paste Special to paste the clipping as a Microsoft Excel Worksheet.  It landed in the Writer document just fine and presents perfectly.  

4. If I selected the Edit or the Open option on the included image, Writer uses OLE to launch an Excel instance that has the full spreadsheet, with my selection still high-lighted.

5. If I look at the .odt file, here is the ODF for the insertion (I hope the XML renders properly):

<text:p text:style-name="Standard">
   <draw:frame draw:style-name="fr1" draw:name="Object1" text:anchor-type="paragraph"
               svg:width="5.948in" svg:height="1.4898in" draw:z-index="0">
      <draw:object-ole xlink:href="./Object 1" xlink:type="simple" xlink:show="embed"
                       xlink:actuate="onLoad" />
      <draw:image xlink:href="./ObjectReplacements/Object 1" xlink:type="simple"
                  xlink:show="embed" xlink:actuate="onLoad" />
   </draw:frame>
</text:p>

6. Both of the "Object 1" files are binary.  The draw:image one appears to be a WMF or EMF file and it renders as the image of the selection.

7. The other "Object 1" is also raw binary.  You will be pleased to know this is how it begins:

D0 CF 11 E0 A1 B1 1A E1-00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00-3E 00 03 00 FE FF 09 00
06 00 00 00 00 00 00 00-00 00 00 00 03 00 00 00
00 00 00 00 01 00 00 00-00 10 00 00 5B 00 00 00
01 00 00 00 FE FF FF FF-00 00 00 00 03 00 00 00
5E 00 00 00 5F 00 00 00-FF FF FF FF FF FF FF FF

8. Now, there are a lot of ways that the ODF has too little information and only the implementation (OO.o in this case) knows what the two "Object 1" files are and what they are intended to be.

9. NEVERTHELESS, I was able to save the OO.o file as a Word 95/2000/XP .doc file and it opened perfectly and the embedding worked perfectly. (In this case, Edit will open the edit window in place in the document, whereas Open opens a complete separate Excel instance that looks and operates identically to the one from OO.o (naturally, they were both fired up using OLE).

10. I even saved the .doc version as a .docx and looked at the OOXML file.  In this case there is a lot more information, both in the document.xml markup about the OLE object and in the way the binary bits are stored (image1.emf in the one case, oleObject.bin in the other).

11. So roundtripping works, and the ODF that OO.o produces is completely valid (according to the ODF specification) even though it relies too much on application-specific tacit behavior rather than providing as much description of what things are as OOXML does.

12. NOT CROSS-PLATFORM.  Now OLE works because you have the other application on the machine you open the containing document on.  If the application is not there, you just get the alternate rendition and cannot open or edit.   That strikes me as not too bad.  You are free to disagree.  I just want to point out that when OO.o does ODF-legitimate OLE embedding, it is no better (and actually less helpful) than the OOXML case as it is used by Word.

orcmid United States |

4/20/2008 3:56:57 PM #

orcmid

I forgot something when exploring the .odt file that had the Excel embedding via OLE.  The manifest.xml file has important information.  In particular,

<manifest:manifest xmlns:manifest="urn:oasis:names:tc:opendocument:xmlns:manifest:1.0">
      <!-- Only the two Object 1 objects are shown -->
   <manifest:file-entry
         manifest:media-type="application/x-openoffice-wmf;windows_formatname="Image WMF""
         manifest:full-path="ObjectReplacements/Object 1" />
   <manifest:file-entry
         manifest:media-type="application/vnd.sun.star.oleobject"
         manifest:full-path="Object 1" />
</manifest:manifest>

Interesting, huh?

Now, one can argue for a different kind of embedding that might work better, but it will have to retrace all of the requirements and integration flexibility that OLE supports.  The other prospect is to figure out how to harmonize OLE.  Technically, as long as there is some sort of OLE middleware on any platform, there could be harmonization so long as the OLE servers deliver payloads to the clients that can be recognized by their peers on other platforms.  

The ability of the client to not know how many sources of OLE objects there are, having it be more like a big callback process to deliver the persisted object data back to the appropriate object instantiation mechanism is important.  It will be interesting to see what a harmonization can achieve with regard to identification of the object class and finding a platform-neutral mechanism for instantiating and then dealing with the OLE (or whatever else it is called) objects.

I concede that having a common way of knowing the class identification is important.  For the OpenOffice.org version, that knowledge would appear to be buried within the application/vnd.sun.star.oleobject media type, whatever it is.

A fitting challenge for harmonization, aye?

orcmid United States |

4/20/2008 4:11:45 PM #

orcmid

In point (1) of my first April 20 note, the spreadsheet that I used is in .xlsx format, and is saved in .xlsx format.  However, for OLE purposes, Excel 2003 knows nothing of the .xslx format and has no persistent file that it can share that it understands -- the Excel 2003 version only lives in memory of the application, in essence.  That is a minor point, but it is why I had to use the clipboard because OO.o gets confused about the file-type if I give it the .xlsx to embed (not sure whether the problem is in OO.o not finding what it needs about the extension in the registry or whether the problem is from elsewhere).  However, using the clipboard to "paste" an embedding is an important use-case and it demonstrated what I wanted to see in OO.o's .odt file.

orcmid United States |

4/21/2008 4:39:35 PM #

jlundstocholm

Dennis,

Thanks for the elaborate details you provided.

I am not so much arguing against OLE or trying to diminish the benefits of the OLE-technology.

What I am saying is that when Microsoft Office 2007 wraps an embedded object in a OLE-container, it is an action that makes the particular document in question more easy to process for a platform and to an application supporting OLE. It might not hurt interop, but it certainly does not make it any easier when transferring documents cross-platform. Embedding of objects should be platform-agnostic, and I simply think the actions of Microsoft Office 2007 stretches this a bit too far.

Smile

jlundstocholm Denmark |

4/22/2008 1:44:47 AM #

orcmid

It is not clear that Word is doing the wrapping.  It may be that the wrapping is done by the OLE host (I.e., OO.o calc) and all that Word is doing is storing it.  

What seems clear to me is that if Word is doing wrapping, so is OO.o Writer (!) yet Writer is able to save a file with an OLE embedding as Word 95/2000/XP and the OLE embedding is recognizable to Word 2003, which is also able to save it successfully to OOXML.

I think we need to look closer at what is going on.  It may be that more information would be better, but the OLE embedding is unlikely to be the plain file that the OLE host was working on.  There is other information that and the OLE client (the container app) has no idea what the host wants to deliver as the embedding data.

I think the key question is how does a container app (using OOXML or ODF format) know how to connect to the correct OLE host.  It is clear that for OO.o 2.3's use of ODF, it must be stashed in the "Object 1" binary data.  It looks like Word 2003 and its OOXML carry the identifier explicitly in the XML element for the embedding (but you have to know how the Windows registry is used to resolve the ID).  Something to clarify this area would be great, perhaps as supplemental information and practice and then maybe something that can be introduced in maintenance to both specifications.

One additional note.  Because the ODF manifest has the useful information about what those "Object 1" files are, I checked the OOXML [Content Types].xml for the .docx that Word 2003 saved.  It is interesting to compare with the OO.o counterpart:

<Types xmlns="schemas.openxmlformats.org/.../content-types">
  <Default Extension="bin"
           ContentType="application/vnd.openxmlformats-officedocument.oleObject" />
  <!-- Other Defaults and over-rides omitted -->
</Types>

It would appear that this is under the OOXML "namespace, MIME-type" administration and should be documented somewhere in OOXML, yes?  I will go hunting after I get a chance to see DIS 29500 final.  If it is not documented, it is certainly something to make specific as part of maintenance.

Thanks for letting me play here.  I found this to be an useful investigation and I don't think we are that apart on the matter.  It is important, as you know, to draw out all of the considerations and verify what the actual opportunities are without disrupting something that seems to be working, even if we are not sure why and how.

orcmid United States |

4/22/2008 6:27:51 PM #

nksingh

@orcmid:

What makes you think OO.o doesn't instantiate the OLE app just like Word does, through a GUID reference into the HKLM\Software\Classes\CLSID registry key?

nksingh United States |

4/23/2008 2:10:29 AM #

orcmid

@nksingh: I am not sure who you are asking but I presume that OO.o does pretty much exactly what Microsoft Office software does.  (Actually, I don't think it uses the guid directly, but finds the guid in the registry from a class name string.)

I think the concern here, now that we agree both do OLE embedding the same way (most of the time, there seem to be some special cases and odd edge-case behaviors in OO.o), is how someone inspecting the file could figure out what application and OLE "server" an embedding depends on.  The information appears to be wrapped up in the blob by OO.o.  Although OOXML makes more information available in the OOXML for the embedding, it may need to stash material in the blob too.  (In the example I did, the binary streams for both the WMF/EMF and the embedding were smaller than the ones created by OO.o, and I have no idea why or how.)

So there is something to have more sharply defined in both specifications (at least in the case of OOXML, which uses an OOXML mime-type for the embedding -- OO.o uses a Sun Microsystems mime-type).

A place where more specificity is required to accomplish harmonization.

orcmid United States |

2/10/2009 11:05:07 PM #

Nasir Khan

Hello,

I would like to add external file(like pdf) to the docx package with following conditions:

•  I want to add it when my document(activedocument) is open.
•  I don’t want it to be visible inside document content(like media). I can’t use activeDoc.InlineShapes.AddOLEObject(), because I don’t want it to be visible as a part of document content.
•  I can’t use Packaging API(Package , PackagePart) like Package package =Package.Open(packagePath, FileMode.Open, FileAccess.Read), because the document is open. Can’t do like mentioned here >> http://openxmldeveloper.org/forums/thread/550.aspx
•  I want to add this file as a part of docx package as mentioned in figure below:
visio.docx
  word
      Media
    image1.pdf  
I want it like the way we load CustomXMLParts using Microsot Word Object Model.

activeDocument.CustomXMLParts[activeDoc.CustomXMLParts.Count].Load(@"c:\trend micros.JPG");

Awaiting for reply.

Thanks and Regards,
Nasir Khan
Persistent Systems Limited | Microsoft  | nasir_khan@persistent.co.in|+91-20-30563861 |+91-9850554834
My Home Page : network.nature.com/people/nasir_khan/profile

Nasir Khan India |

3/6/2009 6:24:14 PM #

jlundstocholm

Hi Nasir,

I think your main problem is that the document is "open" when you want to add the PDF to the package. There should be no problems using the SDK to add the PDF to the package itself.

If you insist of having the document open when adding the PDF, your only choice (I believe) would be to use the API of the application having opened the document.

But - if you don't want the PDF to be visible - why would you want it in your document ... if it cannot be accessed by the UI?

jlundstocholm Denmark |

3/1/2011 5:31:48 AM #

Lady

I find that readers respond very well to posts that show your own weaknesses, failings and the gaps in your own knowledge rather than those posts where you come across as knowing everything there is to know on a topic. People are attracted to humility and are more likely to respond to it than a post written in a tone of someone who might harshly respond to their comments.

//Link removed
//Jesper

Lady Iceland |

3/1/2011 6:57:14 AM #

orcmid

Wow, this spam is so well-crafted and usable on practically  every blog there is it made me look at the link to the poster.  Funny.

orcmid United States |

3/8/2011 9:32:10 PM #

jlundstocholm

Yes - they are getting quite well at this Smile

It was nice meeting you and your wife last week!

jlundstocholm United States |

3/1/2011 6:57:35 AM #

orcmid

Wow, this spam is so well-crafted and usable on practically  every blog there is it made me look at the link to the poster.  Funny.

orcmid |

5/23/2012 1:05:14 PM #

pingback

Pingback from zoofu.shikshik.org

Ooxml mime | Zoofu

zoofu.shikshik.org |

Comments are closed