What is a conversion, really?

by jlundstocholm 30. November 2007 19:10

I have been part of some work for the the Danish National Telecom and IT Agency (IT- og Telestyrelsen). They have coordinated quite a few projects around the country to evaluate the usage of ODF and OOXML and possible problems with co-existance of the two document standards. The website for this work is at http://dokumentformater.oio.dk .

The basic setup for the projects and tests has been:

How does a particular department handle the two document formats and possible conversion between them?

Which problems will arise given their current software install-base?

Is it possible to provide some guidance to the departments regarding which specific features of a document format to avoid since they cause problems?

In other words it has been a rather pragmatic approach based on trying to answer the question: "Why do you experience the problems you see?"

Observations

The first thing we realized during the very first day was something quite crucial:

We were not testing compatibility between two formats - instead we were testing quality of converter-tools and compatibility between the specific format and the internal object model the format is loaded into.

Converter-tools

Both OOXML and ODF are rather immature document formats in the market today since neither of them has a broad market penetration as such. Despite the document count on Google, ODF is not widely used and most people still save their work in .DOC-files -even though they have Microsoft Office 2007 installed. This means that conversion between them is also rather immature and this affects the quality of the converters and the results of converting between one format and another. The ODF-Converter project has an extensive list of the differences between the formats themselves and also a list of features currently not supported by the converter and similar lists exist of features not supported by the other tools used. Luckily it seems that the quality of the converters are drastically improving for each incremental new release.

We also noted that a converter is not "just a converter". It lives and breathes on the application it is installed. This was of particular interest when looking at the ODF-Converter Office Add-In and the SUN OOXML-converter. They are both add-ons to existing Office applications but the application behaviour we saw was in principle the same when using OpenOffice.org, IBM Lotus Notes 8 or OpenOffice Novell Edition.

The problem lies in the fact, that every application has an internal object model that determines how a document is persisted in memory in the application. The binary format for Microsoft Office files were essentially a binary dump of the current memory in the application and this basically counts for at lot of applications with binary file formats. Anyway - regardless of how a document is "converted" or "transformed" using another application than the originator, at the end of the day it has to be loaded into the internal object model for the receiving application. This essentially means, that unless there is a 100% air-tight 1-1-mapping of the document format and the internal object model ... information will be lost. This was one part of the problem - the other was the sequence of conversion. Take a look at the sequence listed here:

Sequence 01  Sequence 02
   
load original format Load original format
 ↓
Convert format to new format Load original format into internal object model
 ↓
load new format into internal object model (make changes)
 ↓
(make changes) Persist as new document format
 
Persist as new document format  

It is not entirely evident that this will produce the same output, and we have seen no evidence that any of applications tested did actually have a 1-1 mapping between (any) document format and their internal object model. This also counts for Microsoft Office and its corresponding file types and OpenOffice itself. In short, this was a fact that we had to deal with in our tests.

On a funny note:

The conversion tools we used were all based on XSLT-transformation between the document formats. They are both XML-formats, so it is a good choice. However, we heard rumours that Novell would dump their OOXML-converter (based on XSLT) and develop their own converter based on the internal object model. It will be interesting to see, if it brings greater quality to the converters.

On a lighter note:

We saw in our tests that using the binary Microsoft Office file format as a middle-man when converting from OOXML to ODF (and back) actually produced the best results ... by a long shot. Having this step and using the binary Office file format as a type of "Lingua Franca", was more or less the key to "flaw-less conversion". If you stop and think about it, it makes perfect sense why we saw this. The Microsoft Office Binary file format is well established in the market (not thanks to Microsoft, but to reverse engineering) and the format has been arround for a long time. Basically, all applications can read it and all applications can write it. But why is this interesting? Well, OOXML is an XML-version of the binary Office file format, so since there are "no problems" with converting from the binary format to ODF, it should be technically relatively easy to convert from OOXML to ODF, since OOXML is a binary version of the binary file format.

It is just a matter of time ... and continious improvement of the format converters. 

Comments

Comments are closed