a 'mooh' point

clearly an IBM drone

And we have lift-off (white noise)

It is only a matter of minutes until the BRM starts here in Geneva. It will be a tough week here with long meetings during the day and preparation in the evening (good bye Red Light District!). So to be able to concentrate fully on the task at hand, I will shut down this blog for the  week ... but don't worry - I'll be back soon.

Smile

Santa Claus is coming to town

Now there are only a few days until I jump on a plane and head South to Switzerland, Geneva for the ISO/IEC SC34 Ballot Resolution Group Meeting, amongst laymen known primarily as "The BRM meeting". I cannot get my head around if I am exited or worried about the outcome of the meeting ... thinking primarily about the enormous workload expecting us down there. We will have to work through about 1000 unique disposition of comments from ISO/IEC editor Rex - scattered over about 3500 comments in total. It's a daunting task indeed - not least for BRM convenor Alex Brown from BSI UK. Adding to this workload is the small addition, that we will be 120 delegates dealing with it. It truly is breath-taking and I cannot help but feel like a mountain-climber standing at the foot of Mount Everest waiting to start the journey upwards. I expect the days to be work in the BRM meeting during normal work hours and work in the evening at the hotel sifting through the results of the day preparing for the next.

I am also thinking quite a bit on what will actually take place in Geneva at the meetings. As I understand the ISO rules (and please note, I have been wrong before), after the BRM is done, the standard to approve is the original submission with the changes made in Geneva. In other words - if not a single disposition can be agreed upon, the standard stands as it did when it was submitted in Spring 2007. I really hope that the delegates opposing OOXML do not try to paralyze the BRM with a massive DOS-attack on the process. As Alex Brown points out, it is the responsibility of the Head of Delegations (HoD) that this does not happen, and if I look at what we have been informed by the Danish HoD, it is clear to me, that they actually have a lot of future credibility in standards work vested in this. If they are not able to perform in an ordily manner at the BRM, their influence in all the other work they are doing will be diminished. I hope this will keep the lid on most of the fanatic out-bursts.

I am also looking forward to meeting some of the people I met in Kyoto in December 2007. Of course it is always nice to talk to people you agree with, but I sometimes get a bit bored with the "echo-chamber"-feeling of spending too much time with people of your own opinion. So I am even more looking forward to conversations with the delegates (and, yes, even the people of Open Forum Europe, who I have been told will be cheering us along in the corridors of the meeting) who are a bit more on the negative side of DIS 29500. It will be interesting to see what they think.

OOh ... and on Saturday I will go see Dinosaurs

Wanna join? 

 

Interoperability - between what?

What is interoperability, really?

Well, when it comes to document formats, some people seems to think that interoperability is the ability to transform one format to another. That high-fidelity interoperability can only be achieved when it is possible to perform a complete translation/conversion of format X to format Y.

The basic problem for this premis is that if you were able to do this conversion, it would be the same as being able to make a 1-1 mapping between the functionality and features of format X and format Y (and vice versa). However - this effectively means that format X is actually just a permutation of format Y ... making format X and format Y the same format (pick up your favorite book on mathematical topology to see the details).

When it comes to ODF and OOXML, the case is pretty clear - the two formats are not the same. Sure - they can both define bold text,  but there are quite a few differences between the formats. A list of some of them can be found at the ODF-Converter website. I think that the list is the best argument for not being able to do a complete conversion of ODF to OOXML (and back). This was also one of the conclusions of the Frauenhofer/DIN-work in Germany, where they concluded that a full 1-1 mapping between the two formats could not be done.

The key question here is: Is interoperability diminshed by this fact?

If you ask Rob's posse, they will almost certainly say "Yes". They will say something like "Microsoft chose not to make OOXML interoperable with the existing ISO-standard ODF and therefore OOXML is a blow to interoperability".

If you ask me, I will say "No". I will say no because the term "interoperability" has been hijacked by the anti-OOXML-lobby in much the same way the SVG-namespace was hijacked by ODF TC. I will say "No" because interoperability means something radically different. The meaning is not rocket sciency, really ... and usually most people agree with the basis definition of interoperability. A few of those are:

Computer Dictionaly online: 

http://www.computer-dictionary-online.org/interoperability.htm?q=interoperability

The ability of software and hardware on multiple machines from multiple vendors to communicate.

IEEE: 

http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?tp=&isnumber=4683&arnumber=182763&punumber=2267

the ability of two or more systems or components to exchange information and to use the information that has been exchanged

US e-Government Act of 2002:

http://frwebgate.access.gpo.gov/cgi-bin/getdoc.cgi?dbname=107_cong_public_laws&docid=f:publ347.107.pdf

ability of different operating and software systems, applications, and services to communicate and exchange data in an accurate, effective, and consistent manner.

If you also look at the enormous list from Google you will see, that none of the definitions talk about the ability to convert formats. Instead they talk about communication between machines, platforms and networks. This is very close to my definition of interoperability when it comes to document formats.

The interoperability gained by using a specific document format is based on the possibility of implementing the format on any kind of platform, in any kind of software using any kind of operatingsystem. It is based on how well and consice and clear the language of the specification of the format is and it depends of howwell thought out the specification is.

It has nothing, nothing, nothing to do with the possibility of converting the format to any other format. 

Word of recognition from an unexpected side

Today - or was it yesterday? - Patrick Durusau issued an open letter regarding the standardization of OOXML. It is an interesting read - especially for those of us that have worked endless hours in NSBs with processing the dispositions of comments from IEC/ISO editor Rex Jaeschke. I will not dig too much into the details of the statement, since I am sure others will do so, just quietly note that is it nice once in a while to be appreciated and not only picked at because of our "lack of qualifications" and accusations of being angle-grapping, bribed, paid for puppets only acting by the will of Microsoft.

Thank you, Patrick!

Smile

I will only quote this:

The OpenXML project has made a large amount of progress in terms of the openness of its evelopment. Objections that do not recognize that are focusing on what they want to see and not what is actually happening with OpenXML

Ooh - and one prediction: I think the anti-OOXML-lobby will try to drop this like a hot potato. The Pro-choice side will naturally salute this - and the Pro-ODF side will quietly wait out the storm quietly mumbling "Nothing to see here, please pass along".

Yes, some of them might even use some of the skills they learned in the third part of the course they took, Hypocricy 101.

"Talk is silver, but silence is gold"

Do your math - OOXML and OMML (Updated 2008-02-12)

As I promised in my latest article about ODF and MathML, I have worked a bit with the ECMA-equivilants of ODF and MathML: OOXML and OMML (Office Math ML).

A bit of introduction is propably a good idea:

In OOXML, mathematical content is structured using the internal markup language, Office Math ML or OMML, for short notation. OMML is closely tied to the structure of WordProcessingML and the look-and-feel is very similar. In contrast to the ODF-way, OMML is usually inserted inline in the WordProcessingML whereas it in ODF is kept in a seperat part of the package. 

Ok - now that that is done with - lets get on with the good stuf!

As in my previous article, I'll work with the same  base equation



Now, as I wrote in the other article, learning MathML is like learning a new (programming)-language, and I can tell you, it is no different with OMML. MathML arranges the mathematical elements by position whereas OMML arranges the mathematical elements by their explicit meaning, so a fraction is created in MathML as (simplified)

<math:mfrac>
  <math:mi >
π</math:mi>
  <math:mn>4</math:mn>
</math:mfrac>

and in OMML it is created as (simplyfied)

<m:f>
  <m:num>
    <m:r>π</m:r>
  </m:num>
  <m:den>
    <m:r>4</m:t>
  </m:den>
</m:f>

So when dealing with MathML and e.g. fractions, we look at a fraction with "something at the top and something at the bottom". When dealing with OMML, we deal with "numerators" and "denominators". It is rather clear to me, that any skills learned in MathML are not directly applicable to OMML - and vice versa. It took me about the same amount of tíme to "get" MathML as it did to "get" OMML. In both cases, I had not worked with the specific ML before. It has taken me about a day to research and write each article.

Anyway - back to the plot.

As always I work with my friend, "the minimal OOXML-file". It is an OOXML-file stripped from all the junk and cut down to the bare minimum - not even a single, not-used namespace declaration is left behind. You can see the minimal file here: Minimal OOXML.docx (1,16 kb).

So my task was a two-step-task: Since OOXML is rather new there is not that much information about OMML out there. So as first step I created a sample equation using Word 2007 to get a feeling of what it's all about. Then I found Part 4 of the OOXML-spec, located section 7 and started to put the OMML together. The OMML I ended with was this:

<m:oMathPara>
  <m:oMath>
    <m:r>
      <w:rPr>
        <w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math"/>
      </w:rPr>
      <m:t>cos</m:t>
    </m:r>
    <m:f>
      <m:num>
        <m:r>
          <w:rPr>
            <w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math"/>
          </w:rPr>
          <m:t>π</m:t>
        </m:r>
      </m:num>
      <m:den>
        <m:r>
          <m:t>4</m:t>
        </m:r>
      </m:den>
    </m:f>
    <m:r>
      <m:t>=</m:t>
    </m:r>
    <m:f>
      <m:num>
        <m:rad>
          <m:radPr>
          </m:radPr>
          <m:deg/>
          <m:e>
            <m:r>
              <m:t>2</m:t>
            </m:r>
          </m:e>
        </m:rad>
      </m:num>
      <m:den>
        <m:r>
          <m:t>2</m:t>
        </m:r>
      </m:den>
    </m:f>
  </m:oMath>

I bet you are now thinking what I was thinking: what the f***? That's a lot of markup! Well, the reason why there is so much markup is that each piece of text/data in the equation is encapsulated in a "run"-element that enables additional styling. If all this additional markup including other property-markup is removed, the result is this:

<m:oMathPara>
  <m:oMath>
    cos
    <m:f>
      <m:num>π</m:num>
      <m:den>4</m:den>
    </m:f>
    =
    <m:f>
      <m:num>
        <m:rad>
          <m:e>2</m:e>
        </m:rad>
      </m:num>
      <m:den>2</m:den>
    </m:f>
  </m:oMath>
</m:oMathPara>

Ain't that purdy?

The OOXML-file with the equation is available here: minimal ooxml with math.docx (1,25 kb). It displays like this in Microsoft Office 2007:

Why not just use MathML?

Before I go into the details with converting from MathML to OMML, I think it is appropriate to pause and look at how MathML and OMML differ from each other. As I noted above there is quite a lot of "overhead" in OMML with everything being encapsulated in "runs". But there is a reason for this. The overhead enables us to do a couple of things that we cannot do with MathML.

Everything fits

You can put virtually everything into a OMML-formula that you can put into a normal WordprocessingML-fragment. As Murray Sargent puts it:

Word needs to allow users to embed arbitrary span-level material (basically anything you can put into a Word paragraph) in math zones and MathML is geared toward allowing only math in math zones. A subsidiary consideration is the desire to have an XML that corresponds closely to the internal format, aiding performance and offering readily achievable robustness. Since both MathML and OMML are XMLs, XSLTs can (and have) been created to convert one into the other. So it seems you can have your cake and eat it too. Thank you XML!

MathML allows some styling of the individual text fragments in the equations, but that's basically it.

WordprocessingML look-and-feel is preserved

To me it is really nice to work with markup for equations that is similar to the markup surrounding it. If I was to use MathML inline instead of OMML, the markup would be completely different than the markup around it. You can say that using MathML enables you to reuse any MathML-skills you might have in advance. Similarly you can say, that using OMML for equations enables you to reuse the skills you have from working with WordprocessingML. It's kind of a "give-and-take"-sitiation.

Revision-control (change-tracking) is possible

Having the overhead enables change-tracking on the same granular level as with your regular text. You can track changes in your equations on a character-by-character basis. In Word 2007 it looks like this when I make a modification to the equation (multiply the second fraction with "2" and remove the cosine-function from the first fraction).

 

 

The markup enabling this is here (for removing the cosine function, where "w:del" means "delete"):

<w:del w:id="0" w:author=" Jesper Lund Stocholm" w:date="2008-01-30T10:41:00Z">
  <m:r>
    <w:rPr>
      <w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math"/>
    </w:rPr>
    <m:t>cos</m:t>
  </m:r>
</w:del>

This is not at all possible when using MathML out-of-the-box. You cannot merge the MathML with other markup like this, and if you use MathML as it is done in ODF (i.e. not "inline) it is simply impossible (at least as far as I can see). MathML in ODF is treated as an external object. which means that it is encapsulated in a OpenDocument Draw frame. The markup for one of the files I used in the other article is like this:

<text:p text:style-name="Standard">
 <draw:frame
   draw:style-name="fr1"
   draw:name="Objekt1"
   text:anchor-type="as-char"
   svg:width="2.418cm"
   svg:height="1.034cm"
   draw:z-index="0"
 >
  <draw:object
    xlink:href="./MathML"
    xlink:type="simple"
    xlink:show="embed"
    xlink:actuate="onLoad"
  />
  <draw:image
    xlink:href="./ObjectReplacements/MathML"
    xlink:type="simple"
    xlink:show="embed"
    xlink:actuate="onLoad"
  />
 </draw:frame>
</text:p>

If I wanted to change some text like "Display equation below"  to "Disrply equation below" (add an 'r' and delete an 'a') in ODT, it would look something like this:

<text:p>
  Dis<text:change-start text:change-id="ct102825880"/>
  r<text:change-end text:change-id="ct102825880"/>
  pl<text:change text:change-id="ct102844952"/>
  y equation below
</text:p>

So registration of the changes are - as with OOXML - merged into the text being modified. I think you could mark the whole equation as "modified" in ODF by putting an <text:change-start>-element around the complete <draw:object>-element, but I am not sure it would work. Also, OpenOffice.org doesn't seem to register changes to MathML-zones at all. Using OpenOffice.org it looks like this

 

(I changed the denominator of the first fraction to "54") 

 

I cannot say that there are (or are not) other areas where MathML just doesn't cut it - these were just a couple of those that I have experienced myself. I do believe, though, that the examples above warrant the simply question:

Why the hell did OASIS ODF TC decide to use MathML in the first place?

Interoperability

Interoperability is clearly what the young kids want these days - so let's see what we can do with mathematical content. MathML and OMML are clearly two different markup languages, but is it possible to convert between them? Fortunately it is. Microsoft Office 2007 allows c/p of MathML into OMML-equations and it can even export OMML to MathML. Luckily for us the logic around this is not embedded into some fancy place in Microsoft Office 2007 - it is done using simple XSLT-transformations. They have made the stylesheets OMML2MML.xsl and MML2OMML.xls and if you apply these to either your OMML or MathML, it is translated to the other. Just for the fun of it I tried to convert the OMML-version of the equation to MathML. All I did was to find the OMML2MML.XSL and insert a single line in the XML-file document.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<?xml-stylesheet type="text/xsl" href="OMML2MML.XSL"?>
<w:document
  xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"
  xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  >
  <w:body>
    <w:p>
      <m:oMathPara>
        <m:oMath>
          <m:r>
            <w:rPr>
              <w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math"/>
            </w:rPr>
            <m:t>cos</m:t>
          </m:r>
...

(and then I processed the file using my favorite XSLT-translator)

I'm sure - if you are a "technical" person - you have found yourself using/writing some code and just before you press "Compile" or "Run" you think: "This is sooo not gonna work". This was one of those situations for me - but you know what, it actually worked in the first try. The MathML generated is this

<?xml version="1.0" encoding="utf-8"?>
<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML">
  <mml:mi mathvariant="italic">cos</mml:mi>
  <mml:mfrac>
    <mml:mrow>
      <mml:mi>π</mml:mi>
    </mml:mrow>
    <mml:mrow>
      <mml:mn>4</mml:mn>
    </mml:mrow>
  </mml:mfrac>
  <mml:mo>=</mml:mo>
  <mml:mfrac>
    <mml:mrow>
      <mml:mroot>
        <mml:mrow>
          <mml:mn>2</mml:mn>
        </mml:mrow>
        <mml:mrow />
      </mml:mroot>
    </mml:mrow>
    <mml:mrow>
      <mml:mn>2</mml:mn>
    </mml:mrow>
  </mml:mfrac>
</mml:math>

... and it validates as well (using Amaya and changing the XML-file from a UTF-16 file to UTF-8)

Ét voilá

Now, wouldn't it be cool if the MathML generated from the OMML could be used in a ODT-document? You know what ... it can! I took the MathML above and inserted it into one of the documents I made for the ODF/MathML-article and inserted it into the MathML-zone of the ODF-package. The file is available here: minimal-mathml-omml-inject.odt (1,31 kb).

The result of opening the file using OpenOffice.org:

In the words of Murray Sargent, I guess you can have you cake and eat it too after all.

Smile

Update:

When writing my post about where to get help for ODF-development I suddenly remembered that I missed a part of this article: "The quirks". Because - naturally there are quirks with using OMML with Microsoft Office 2007 ... just as there were with MathML and OpenOffice.org.

Now, if you take another look at the OMML/XML-fragment I created, there were to parts I really couldn't figure out a way to remove:

<m:oMathPara>
  <m:oMath>
    <m:r>
      <w:rPr>
        <w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math"/>
      </w:rPr>
      <m:t>cos</m:t>
    </m:r>
    <m:f>
      <m:num>
        <m:r>
          <w:rPr>
            <w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math"/>
          </w:rPr>
          <m:t>π</m:t>
        </m:r>
      </m:num>

Now, the <w:rPr>-elements should have absolutely nothing to do with the content of <w:t>-element - or more correctly, the visibility of the text in the <w:t>-element should not depend of existance of an <w:rPr>-element. But if the two <w:rPr>-sections are omitted, the "cos"-text as well as the π-sign are not displayed. I really have no idea of why this is to so if you do, please let me know. Maybe one of the Microsoft Office 2007-Math guys could step in here?

ECMA har udsendt de sidste svar

I går var så dagen, hvor de sidste svar fra ECMA blev gjort tilgængelige for de nationale råd rundt omkring i verden. Dermed har ECMA svaret på alle godt og vel 3500 kommentarer, der indløb i løbet af behandlingen af DIS 29500 i sommer/efterår 2007.

Under arbejdet med standarden og diskussionerne om den henover sommeren kunne jeg ikke lade være med at tænke på, at rigtigt mange af kommentarene var det rene vås eller i bedste fald ligegyldige. De var som lavet ud fra devisen "hvor jeg nu bevidst prøver at misforstå det - hvor er det så lettest henne?" (ex: OLE). Det er klart, at der var mange gode kommentarer, men mange af dem var faktuelt noget ævl.

Men jeg må erkende, når jeg nu sidder og kigger på resultatet af behandlingen af kommentarene, at den samlede mængde kommentarer har resulteret i en standard, der på mange måder er bedre end den var før. Standarden er helt enkelt blevet mere præcist formuleret og generelt lettere at anvende. Det er helt klart et anerkendende nik værd overfor alle de mennesker, der (om de er for- eller imod OOXML) har gennemtrævlet forslaget til standard. Tak til jer! Det er værd at understrege, at standarden ikke er blevet lavet totalt om - den er derimod blevet forbedret på en række områder, hvor den trængte til finpudsning. Selve arkitekturen er den samme, dvs den energi man skulle have brugt på at anvende den eksisterende ECMA-376 er bestemt ikke spildt. Af de punkter, hvor jeg synes de største forbedringer er kommet, er:

  • Der er ikke længere noget krav om at skulle anvende VML i nye dokumenter
  • Angivelse af landekoder skal nu ske som specificeret i RFC-4646
  • Det er mere tydeligt, at OOXML skal anvende eksisterende, velafprøvede hash-koder som bla. specificeret ved FIPS-180
  • Conformance-kravene er blevet mere tydelige
  • Den berømte "leap year bug" er nu markeret som forældet
  • Det er muligt at anvende datoer før 1900
  • Formel-specifikationerne for regneark er nu beskrevet i EBNF-notation

Og hvad så med resten af de mange kommentarer som fx "Compatibility-elements? Tja - nu nævnte jeg blot de dele, som jeg synes er de vigtigste (og så har jeg naturligvis sikkert glemt nogle andre vigtige).

Smile

Endnu en spand svar fra ECMA

Så er ECMA klar med endnu en spand svar til de forskellige lande i forbindelse med arbejdet omkring DIS 29500. Af deres pressemeddelelse kan det ses, at ECMA nu har svaret på 92% af de indkomne 3500 kommentarer, og det ser ud til, at det lykkes for dem at nå alle svar inden deadline på mandag d. 14. januar 2008. Af svarene på de danske kommentarer mangler nu kun ganske få at blive behandlet og det bliver spændende at se, hvad ECMA svarer på de sidste punkter.

Én ting jeg har haft svært ved at hitte ud af er, om ECMA får lov af ISO/IEC til at offentliggøre en ny samlet standard med alle rettelser indkluderet. Er der nogle af jer læsere, der har denne information? Så vidt jeg læser JTC1-direktiverne, så må de ikke offentliggøre de enkelte dispositioner i sig selv og heller ikke kommentarerne fra landene, så den eneste mulighed for at få svarene på kommentarene ud er vel at offentliggøre den endelige, fulde rapport. Jeg tror personligt ikke, at ECMA vil offentliggøre den fulde, reviderede, standard før efter BRM i februar - men uanset udfaldet er det jo lidt et valg imellem kolera og pest. Jeg skal være ærlig at indrømme, at jeg har nydt arbejdsroen i de sidste måneder efter 2. september 2007 og specielt efter ECMA begyndte at rundsende svarene til de enkelte lande. Det er klart, at der ikke har været så meget debat - faktisk meget mindre end jeg havde troet - men det er jo også en helt anden situation de enkelte lande står i. I første del af den 5 måneder lange ballot period var det i mine øjne en klar fordel, at OOXML blev diskuteret så bredt, for det fik afdækket en lang række mangler og uhensigtsmæssigheder ved standarden. Jeg tvivler på, at de enkelte lande have kunnet levere samme arbejde, hvis det ikke havde været for GrokDoc, IBM, Andy og andre, der har gennemtrævlet OOXML-spec for fejl. Der var en overhængende risiko for, at landene blot havde stemt "abstain" fordi de ikke kunne forstå spec - ganske som de gjorde med ODF i december 2006. Det gjorde de jo heldigvis ikke, og situationen nu er jo, at de enkelte lande skal se, om svarene fra ECMA til de enkelte kommentarer er god nok. Det er naturligvis et arbejde af en helt anden karakter, og det er min opfattelse, at vi her ikke har brug for nøglepersonerne fra den anden side af floden.

Men - det bliver spændende at se det endelige resultat af ECMA TC45s arbejde. 

Smile

 

What's up with OLE?

A few weeks back I made an article about how Microsoft Office 2007 dealt with password-protection of an OPC-package, since this feature is not a part of the OOXML-specification. The answer I found was that Microsoft Office 2007 persists the password-protected file as a OLE2 Compound File ... more commonly known as a "OLE-file". I also concluded that using OLE2 Compound Files is not a problem - and certainy not an issue regarding OOXML.

Now - the whole topic around OLE has been at the front row of the worldwide debates regarding OOXML. My personal opinion is that the people jumping up and down screaming about problems with OLE ... really haven't understood what OLE is.

So let me start by making a small recap' of what it is really all about.

... there is OLE and then there is OLE 

First of all:

there is "OLE" and then there is ... "OLE"

... or put in another way:

there is the "OLE-technology" and then there is the "OLE-file"

or in a third, more correct, way:

there is the "OLE application technology"  and then there are "Compound Files".

The foremore mentioned is the technology that - on the Windows platform - enables a program to use the UI of another program ... without launching the entire application itself. I mostly use this when editing MS Visio-documents in Word but other usages of this is using an Excel spreadsheet in an MS Word application. The OLE-technology itself is a tool on the Windows-platform that all applications can - and do - use to enable "utilizing other applications in their own applications". It is here important to understand, that there is (today) nothing really revolutionary about OLE. Another similar technology on the Windows-platform is DDE and on the Linux-platform it could be KParts and Bonobo. These technologies simply enable one program to communicate with another (simply put).

But what about these OLE-files?

Well, Compound Files are actually not dependant of OLE-technology. Or put in another way: you don't need OLE-technology to read and use the contents of a Compound File. Compound Files are just files. A Compound File is a collection of persisted streams - actually much like a ZIP-archive. Most commonly it is used because it brings the ability to "utilize a file system within a file". Of course you will need to know how to use the contents of the file, be it created by OpenOffice, Corel Draw, Adobe Acrobat or any application that might store its files using Compound Files. But this is seperate from being able to read and write to the contents of a Compound File.

Ok - I will not bother you any more with this. You should check out the original article about OLE and also look into the specification of the binary formats for Microsoft Office95 - Office2007, avilable from Microsoft. It is actually quite interesting. Just remember that OLE-technology and Compound Files are not the same thing.

And now for something completely different (kindof)

In the lab-tests I have been part of for the Danish Government (National IT and Telecom Agency) we have tested OLE-interoperability. It is important since it is quite normal to embed e.g. a spreadsheet file in a Text-processing file. So it is important that the contents of the file is actually usable when receiving it and opening using another application or on another platform.In this setup we only tested Compound File interop and not interop between OOXML and ODF.

What we did was this:

We created a ODF-file using OpenOffice where we embedded a Excel-spreadsheet (binary .DOC-file) (on the Windows-platform)

We sent this file to a number of different platforms and applications

  • Windows XP using OpenOffice.org 2.3 DA
  • Windows XP using OpenOffice Novell Edition
  • Linux using OpenOffice Novell Edition
  • Linux (SLED) using IBM Lotus Notes 8

We tried to open the file and documented what happened.

#
Setup  What happened? 
1
Windows XP using OpenOffice.org 2.3 DA OpenOffice.org opened the document and correctly displayed the contents of the spreadsheet. It was possible to edit the spreadsheet and save it back into the ODF-container
2
Windows XP using OpenOffice Novell Edition OpenOffice Novell Edition opened the document and correctly displayed the contents of the spreadsheet. It was possible to activate the spreadsheet but only in "read-only"-mode
3 Linux using OpenOffice Novell Edition OpenOffice Novell Edition opened the document and correctly displayed the contents of the spreadsheet. It was possible to activate the spreadsheet but only in "read-only"-mode
4
Linux (SLED) using IBM Lotus Notes 8 Lotus Notes 8 opened the document and correctly displayed the contents of the spreadsheet. When activating the spreadsheet the user was prompted to convert the spreadsheet. When accepting this it became editable and when saving it back into the ODF-container, the spreadsheet was persisted as an Open Document Spreadsheet.


So what we saw was basically 3 different approaches to handling the embedded object. In general the Excel-object (Compound file) itself was not a problem - regardless of application and platform. All combinations had no problems with opening the file and displaying the contents - even on platforms without OLE-technology present. The difference was in the applications and their handling of the object. OpenOffice.org presented the approach that most people would expect: it allowed editing the embedded object and saving it back into the container. OpenOffice Novell Edition allowed activating the embedded object but not saving it back into the container and Lotus 8 took the approach of converting the Excel-object to an Open Document Spreadsheet.

A conclusion?

Well, we took great care not to conclude much - that was not for us to do, we merely provided the technical background for post-lab conclusions. However - the pattern emerging from the description above was similar to a pattern we saw a lot. The problems were not in incompatibility between the formats but instead in how the applications and converters dealt with the formats. We also saw no indications that any of the formats were tied to a specific platform. There were no problems with roundtripping - or to put more clearly: the problem we saw when round-tripping documents were not caused by incompatibilities between the platforms (e.g. Linux and Windows) but between different behaviour in the applications implemented on either platform.

So is this good or bad news? Well, as always, truth lies in the eyes of the beholder ... but I think it is good news. 

Where did my line go?

When we started doing our tests in the lab and started thinking about what we thought we would be seeing, we had a very clear understanding that it would not all be blue-sky conversions and that we would identify problems - some more severe than others. We were also pretty aware, that there would be areas, where conversion was just not possible.

But - I am pretty sure I speak for the rest of the group - we were quite surprised to see which areas this concerned.

On area where absolutely nothing could be converted was ... lines. Not only line art, not only complex line drawings ... but simply - lines.

Lines are done in OOXML as either VML or DrawingML and in ODF it is done using a SVG-derivative. The puzzling thing is, that this area is apparently simply left out in either of the converters. We made some simple documents (line.docx 10,47 kb) and (line.odt 6,60 kb)  [I have re-made these for this article]. When converting these files using CleverAge 1.0 on Microsoft Office 2003 and 2007, Novell OOXML Translator (on Windows and SLED) or IBM Lotus Notes 8 (on SLED), the lines are simply removed. They are not altered, they are not just hidden, they are not moved to a different location in the document ... they are just removed.

This is another example of the overall observation from our tests ... the quality of the converters are simply not good enough today. If you look at the XML in either of the files above, you will see, that even though they look different, they basically specify the same thing (start and end-point for the line drawn), so technically it should pose no problem to be able to do a better conversion.

It is often said, that the main problem with converting from ODF to OOXML (and vice versa) is incompatibilities between the formats. This example is by first glance suporting this argument, but if you dig a bit deeper into the technicality of it, is simply boils down to a problem with bad converters.

Conclusion: The world is seldom black/white ... even if people are trying to convince you so. More often, the world is grey and depressing as a rainy day. 

What is a conversion, really?

I have been part of some work for the the Danish National Telecom and IT Agency (IT- og Telestyrelsen). They have coordinated quite a few projects around the country to evaluate the usage of ODF and OOXML and possible problems with co-existance of the two document standards. The website for this work is at http://dokumentformater.oio.dk .

The basic setup for the projects and tests has been:

How does a particular department handle the two document formats and possible conversion between them?

Which problems will arise given their current software install-base?

Is it possible to provide some guidance to the departments regarding which specific features of a document format to avoid since they cause problems?

In other words it has been a rather pragmatic approach based on trying to answer the question: "Why do you experience the problems you see?"

Observations

The first thing we realized during the very first day was something quite crucial:

We were not testing compatibility between two formats - instead we were testing quality of converter-tools and compatibility between the specific format and the internal object model the format is loaded into.

Converter-tools

Both OOXML and ODF are rather immature document formats in the market today since neither of them has a broad market penetration as such. Despite the document count on Google, ODF is not widely used and most people still save their work in .DOC-files -even though they have Microsoft Office 2007 installed. This means that conversion between them is also rather immature and this affects the quality of the converters and the results of converting between one format and another. The ODF-Converter project has an extensive list of the differences between the formats themselves and also a list of features currently not supported by the converter and similar lists exist of features not supported by the other tools used. Luckily it seems that the quality of the converters are drastically improving for each incremental new release.

We also noted that a converter is not "just a converter". It lives and breathes on the application it is installed. This was of particular interest when looking at the ODF-Converter Office Add-In and the SUN OOXML-converter. They are both add-ons to existing Office applications but the application behaviour we saw was in principle the same when using OpenOffice.org, IBM Lotus Notes 8 or OpenOffice Novell Edition.

The problem lies in the fact, that every application has an internal object model that determines how a document is persisted in memory in the application. The binary format for Microsoft Office files were essentially a binary dump of the current memory in the application and this basically counts for at lot of applications with binary file formats. Anyway - regardless of how a document is "converted" or "transformed" using another application than the originator, at the end of the day it has to be loaded into the internal object model for the receiving application. This essentially means, that unless there is a 100% air-tight 1-1-mapping of the document format and the internal object model ... information will be lost. This was one part of the problem - the other was the sequence of conversion. Take a look at the sequence listed here:

Sequence 01  Sequence 02
   
load original format Load original format
 ↓
Convert format to new format Load original format into internal object model
 ↓
load new format into internal object model (make changes)
 ↓
(make changes) Persist as new document format
 
Persist as new document format  

It is not entirely evident that this will produce the same output, and we have seen no evidence that any of applications tested did actually have a 1-1 mapping between (any) document format and their internal object model. This also counts for Microsoft Office and its corresponding file types and OpenOffice itself. In short, this was a fact that we had to deal with in our tests.

On a funny note:

The conversion tools we used were all based on XSLT-transformation between the document formats. They are both XML-formats, so it is a good choice. However, we heard rumours that Novell would dump their OOXML-converter (based on XSLT) and develop their own converter based on the internal object model. It will be interesting to see, if it brings greater quality to the converters.

On a lighter note:

We saw in our tests that using the binary Microsoft Office file format as a middle-man when converting from OOXML to ODF (and back) actually produced the best results ... by a long shot. Having this step and using the binary Office file format as a type of "Lingua Franca", was more or less the key to "flaw-less conversion". If you stop and think about it, it makes perfect sense why we saw this. The Microsoft Office Binary file format is well established in the market (not thanks to Microsoft, but to reverse engineering) and the format has been arround for a long time. Basically, all applications can read it and all applications can write it. But why is this interesting? Well, OOXML is an XML-version of the binary Office file format, so since there are "no problems" with converting from the binary format to ODF, it should be technically relatively easy to convert from OOXML to ODF, since OOXML is a binary version of the binary file format.

It is just a matter of time ... and continious improvement of the format converters.