(updated 2008-04-14, added links to external resources)
Now that the ISO-vote and approval of OOXML is done with, it is time to continue the coverage of implementing OOXML as well as ODF – this time about OOXML, Microsoft Office 2007 and embedded objects.
As I have previously said, there are always quirks when it comes to implementations of any standard in large applications. I have covered a few of these already regarding mathematical content [0], [1] and it is no different with regards to object embedding. I should say that a source of inspiration to this article was Stepháne Rodrigues’ article about binary Parts of an OOXML-file (OPC-package).
Now, embedding objects in an OOXML-file is pretty straight-forward: Simply add the object somewhere in the package and make a reference to the location and specify what kind of file you are embedding. This is very similar to how it is done in ODF.
(note: the specific schema-fragments defining how to do this were dealt with and changed at the BRM, so I will not include these until the final version of IS 29500 is released. I will update this article according to the revised spec).
As I have noted earlier, interoperability happens at application-level, so it is worth pondering a bit on how the specification is implemented in the major implementations of it. So let’s see how Microsoft Office acts when embedding objects.
What I did was this:
I used Microsoft Office 2007, created a text-document and I embedded an object in it – in this case an OpenOffice.org Calc Spreadsheet. The spreadsheet is also inspired by one of Stepháne Rodrigues’ articles, the infamous “OOXML is defective by design”.
The object is inserted and displayed in the document. When activating the object, I can edit it as if it was in OOo Calc itself. Actually it is OOo Calc itself. It is invoked using OLE and as a side-note it shows a cool thing about OLE – or similar other object linking techniques. Microsoft Office 2007 does not know anything about OpenOffice.org, yet it is still able to invoke the application and edit the embedded object.
Ok – now let’s look at the OOXML-file created. In the file document.xml the following fragment is located:
The <v:shape>-element is part of the nasty VML-dependency that luckily was dealt with at the BRM. This will be replaced by DrawingML in the final IS 29500. The <o:OLEObject>-element specifies the type of the embedded object (“opendocument.CalcDocument.1”) and the location of it (“rId5”). There is really nothing platform dependent here in the OOXML-markup.What is more interesting, though, is looking at the Calc-object after it is embedded. By navigating through the relationship-model of the OPC-package, the embedded object is located.
One might think that this file was simply the Calc-file renamed, but sadly this is not so. This file is actually the Calc-file wrapped in an OLE2 Compound file (“CF”). The CF-file is basically a stream wrapper which allows a number of streams to be persisted in a file as well as information about these streams. Using one of the many CF-viewers you can get the data of the wrapped file itself as well as the persisted information of it, here “com.sun.star.comp.Calc.SpreadsheetDocument _ Embedded Object _ opendocument.CalcDocument.1”.
Technically this is really not a big deal – there are well-known ways to manipulate these files on all platforms and most programming languages and extracting the required data should really be a no-brainer. OpenOffice.org is licensed under LGPL, so you can use the source-code from this to figure out how to do it on the platforms supported by OpenOffice.org. It is also pretty evident why Microsoft Office 2007 works this way. Microsoft Office 2007 is the latest incarnation of the Microsoft Office Suite – a suite that has depended on this file format since at least 1999 … and of course on OLE itself as well. So if you want to implement a document consumer, this is simply something to be aware of when consuming OOXML-files.
From the perspective of a developer, however, this is really annoying. I would definitely opt for Microsoft Office 2007 embedding the objects simply as the objects they are – and not wrapping them in a CF-wrapper. This is how it is done in OpenOffice.org. Granted, this suite does other weir(d) things like renaming the files and not being entirely clear how to embed all object types, but the objects are embedded as they are (unless they are OpenDocument objects). This is a benefit to me as a developer when examining OOXML-files, because I can simply extract the object in question from the document package and verify the file.
So this might be the first new post-vote change-modification to IS 29500:
When embedding objects an application shall not modify or wrap the embedded object in any way before embedding it in the package. When a document consumer encounters an embedded object, this shall not be converted to another object type without knowledge-based confirmation by the user.
This (or similar woring in standard-lingo) would prevent Microsoft Office in wrapping objects on CF-wrappers, but it would also prevent applications like OpenOffice.org on SUSE to convert embedded Excel-objects to Calc-spreadsheets. FYI, this kills interop too.
A final request: Microsoft, please, as you must already be implementing the changes from the BRM for Office 2007, would you be so kind to make this change to the application as well? It should really be a no-brainer, and if there should be any requirements in your code for the CF-files, feel free to load the objects, wrap them in an in-memory CF-file and take it from there.
198a3ea2-92d2-4b88-8ebe-efa659983b2d|0|.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04