a 'mooh' point

clearly an IBM drone

OpenXml SDK released as OSS

Yesterday I was notified that the OpenXml SDK had been released as an Open Source Project by Microsoft.

Back in summer 2008 I attended a workshop in Redmond, WA regarding the future support of OOXML and ODF in Microsoft Office. I remember sitting down with one of the PMs at the time – talking to him about what a wonderful idea it would be to release the OpenXml SDK as OSS. I also remember how frustrating it was to be told – amidst between the lines - that “it ain’t gonna happen”.

Now – almost exactly 6 years later, they have finally listened (btw, I am in no way trying to take credit for “making” Microsoft OSS the OpenXml SDK – it was completely their decision to do it). But it does seem to confirm a trend in Microsoft – where the revenue cows (Office, Windows, and Servers etc.) are kept closed, but the tooling around them, the stuff that ties them all together – is with increasing frequency being released as open source.

The OpenXml SDK is released under the auspice of “MS Open Tech” – in other words; Doug Mahugh and friends. Eric White has been an integral part of making this happen. Kudos to all of them from here :-).

The source code is available on github and is free for everyone to look at and download. The license is Apache 2.0 . It will still remain to be seen if they request pull-requests, but I cannot imagine why they should not.

Now, I haven’t had the time to dig into the code in much detail yet, but I will do this in the following weeks. One thing I will look deeply into is the .Validate()-method of the toolkit. It validates the content of the OOXML-document being worked at – oh well, it should do, but if anyone has tried to run a document through e.g. my validator on http://29500.idippedut.dk or Alex Brown’s at https://code.google.com/p/officeotron/ will have found out, that the document – even with a “clean” result from .Validate() is not valid according to the schemas of OOXML. It turns out, that it does not validate against the spec – it validates against the supported functionality of Microsoft Office. Now, that is a completely valid (no pun intended) approach from the SDK, since most working with OOXML at the end of the say need interoperability with Microsoft Office.

But now with the SDK being released to a larger amount of developers, I guess it would be appropriate to expand or “fix” the validation-method. One possible improvement could be to allow validation against a range of XML schemas. Another would be to allow validation after haven processed the document applying MCE to the content. A Third improvement would be to write out dependency of WindowsBase.dll ( and thereby System.IO.Packaging) . I have a theory that the reason why OpenXml SDK is not available on Windows Phone is this exact dll, and it would be nice to be able to manipulate OOXML-documents in memery on WP.

We’ll see what will happen to it in the future – what would you like to have changed in the SDK?

ODF 1.2 in ISO (PAS-submission)

The other day a document landed on my "desk" in the Danish Mirror committee to JTC1 SC34. It was the document ”EXPLANATORY REPORT - OASIS Submission of OpenDocument v1.2 to ISO/IEC JTC 1 [JTC1 N12033]”. In other words: ODF TC in OASIS has wished to elevate OASIS ODF 1.2 to an ISO-standard.

OASIS is a so-called "PAS Submitter" to ISO, which enables more or less direct elevation of existing standards to be an ISO-standard, but without any corresponding work on the standard itself in ISO. OASIS ODF TC has used this process for ODF 1.0 back in 2006.

So ODF will not be maintained in ISO - and agreement has been made that work itself developing and improving ODF will be exclusively done in OASIS ODF TC - with subsequent releases to ISO for approval ... what someone might refer to as "rubber-stamping".

For OOXML a different agreement was made during sumission/approval of OOXML - that being that work with OOXML takes place in ISO and this is where the standard is developed and improved. That being said, a substantial amount of work with OOXML takes place in ECMA and a large amount of our work in ISO originates from ECMA's TC45 - the group where OOXML was "born". ODF 1.2 was approved in OASIS in september 2011 (and publicized in early 2012), and now three years later it has landed on our desk in ISO.

I immediately wrote to Danish Standards and told them, that I suggest that Denmark votes "yes". Technically this is not a task for the mirror committee to SC34 since the vote is on JTC1-level, but the more I think about it, the more I doubt that I made the right suggestion to Danish Standards.

Because does it add any value for anybody to - three years after approval in OASIS - ask for an ISO approval? Is there any good reason to spend time on this in OASIS and in ISO? If the argument is that there are some (governmental) institutions that require an ISO-standard level to use it in their organisations/countries - what good is it to them to wait more than three years to submit it to ISO for approval?

Moving towards OOXML(S) (update)

Some time ago I wrote about some of the enhancements of Microsoft Office in terms of how far they have made it in implementing the content of the conformance profile "Strict" or "<S>". As you might recall, I made a run-through of a list of feature areas and marked each with either a green, yellow or red traffic light. There were no red traffic lights, but some areas had a yellow marking. These were

  • "ink"
  • "legacy diagrams"
  • "groups"
  • "form controls"
  • "activeX objects".

These document types previously used VML as containing frame etc, but Microsoft Office 2010 was now supposedly using DrawingML for these. The reason for them being yellow and not red was that I did simply not know how to test these things - either because of poor Microsoft Office skills or lack of proper hardware ("ink" is used on tablet PC's and I don't have one of those at hand).

Stockholm plug-fest

When WG4 met in Stockholm a couple of months ago, I got a chance to take a look at the documents I couldn't create myself. The cool thing about participating in these meetings is that there is an abundance of different hardware and software on the laptops of the delegates, so after one of the sessions a few of us had our own little "Microsoft Office OOXML <S> interop plug-fest" and I finally had a chance to get my hands on those files.

I could have simply updated the previous article with the new information, but a couple of interesting thing emerged that made me write up a new piece.

First, the results are this:

File typeFeatureComment 
DOCX Ink Drawings Previously used VML, now uses DrawingML
Success, green traffic light
XLSX Ink Drawings Previously used VML, now uses DrawingML Success, green traffic light
PPTX Ink Drawings Previously used VML, now uses DrawingML Success, green traffic light
DOCX Legacy Diagrams Previously used VML, now uses DrawingML Success, green traffic light
XLSX Legacy Diagrams Previously used VML, now uses DrawingML Success, green traffic light
PPTX Legacy Diagrams Previously used VML, now uses DrawingML Success, green traffic light
DOCX Drawing Shapes Previously used VML, now uses DrawingML Success, green traffic light
DOCX Textboxes Previously used VML, now uses DrawingML Success, green traffic light
DOCX WordArt Previously used VML, now uses DrawingML Success, green traffic light
DOCX Groups Previously used VML, now uses DrawingML Success, green traffic light
XLSX Form Controls Previously used VML, now uses DrawingML - except on "chart sheets" Success, green traffic light
XLSX ActiveX Objects Previously used VML, now uses DrawingML Success, green traffic light
PPTX ActiveX Objects Previously used VML, now uses DrawingML Success, green traffic light
XLSX OLE Objects Previously used VML, now uses DrawingML Success, green traffic light
DOCX ST_OnOff Uses the new ISO-approved simple type without the values "on" and "off" Success, green traffic light
XLSX ST_OnOff Uses the new ISO-approved simple type without the values "on" and "off" Success, green traffic light
PPTX ST_OnOff Uses the new ISO-approved simple type without the values "on" and "off" Success, green traffic light
XLSX ISO-dates Can persist dates in ISO-8601 format and avoids the "evil" serial dates. Failure, red traffic light

The trained eye will notice that all the yellow lights have been replaced by green lights - In other words, the list above clearly shows that even though Microsoft Office 2010 does not write <S>, the developers in Redmond have clearly made some significant progress.

There are a couple of interesting points about the technicalities of the files I looked at.

Predicting the future is difficult

The files containing "legacy diagrams" stand out, because of the way Microsoft Office 2010 breaks compatibility with e.g. Microsoft Office 2003 and earlier versions. The thing is - when loading a PPT-file with a legacy diagram from e.g. Microsoft Office 2003 the diagram will be in VML-format. When it is loaded in Microsoft Office 2010, modified and saved again - it won't save the diagram in VML. It just won't. The diagram will be saved all right - but now using DrawingML instead of VML. So this is essentially a case where interoperability with this "legacy" application is hurt since Microsoft Office 2003 has no idea what to do with the DrawingML it loads.

MCE to the rescue

For all the other files, MCE once again steps up.

If we look at the file containing the ink notations, (some of) the markup will look like this:

[code:xml]<mc:AlternateContent>
  <mc:Choice Requires="wpi">
    <w:drawing>
      <wp:anchor>
        <a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
        <a:graphicData uri="http://schemas.microsoft.com/office/word/2010/wordprocessingInk">
      </wp:anchor>
    </w:drawing>
  </mc:Choice>
  <mc:Fallback>
    <w:pict>
      <v:shapetype >
        <v:stroke joinstyle="miter" />
        <v:formulas>
          <v:f eqn="if lineDrawn pixelLineWidth 0" />
        </v:formulas>
      </v:shapetype>
      <v:shape id="Ink 15">
        <v:imagedata r:id="rId6" o:title="" />
      </v:shape>
    </w:pict>
  </mc:Fallback>
</mc:AlternateContent>[/code]

(I have modified and trimmed the real XML for easier reading - especially the VML was really, really ugly)

So with this approach you can actually have the best of two worlds - the new and the old without losing information. That is - if you know MCE, of course. This again shows what a great tool alternating content blocks (ACB) of MCE are for this task. It allows you to innovate while still making it possible to ensure some sort of compatibility with earlier programs that did not know of the new technology.

And what about them dates?

The even more trained eye will have noticed that the green traffic light for ISO-dates in SpreadsheetML has been downgraded to a flashing red traffic sign.

The first test I did with Microsoft Office 2010 CTP1 had a small check-box in the "backstage" area that would allow dates in spreadsheets to be persisted in ISO-8601 format. With RTM of Microsoft Office 2010 this check-box is gone so we are now back to using serial dates again.

It would be easy to hit on Microsoft for removing this check-box, and I am sure that many will. But the truth is that the removal of this feature is due to activities in WG4 where we maintain OOXML.

As some of you may recall, the introduction of ISO-dates in OOXML was done in Geneva at the BRM in those hectic days we spent there. The trouble with introducing the ISO-dates in OOXML was that it looked really, really good on paper - but it sucked in real life.

The reason is that we "forgot" to change namespace name for ISO OOXML-files so documents conforming to ISO OOXML<T> share namespace name with documents conforming to ECMA 376. This has has enormous consequences for spreadsheets and those applications designed to support ECMA-376 but not necessarily ISO OOXML. At the second F2F of WG4 in Prague in 2009, we had a demonstration of how bad is was - not a single application would interpret these new dates correctly and - what was perhaps even worse - they did not display any warnings to the user.

We have been discussing this a lot in WG4, and in the end we decided to start the work to remove usage of ISO-dates from Part 4. This correction of a BRM decision was not easy to agree on (AFAIR it has not been finally approved as of yet) and the removal of the "Save-as ISO-dates" feature in Microsoft Office 2010 is propably in anticipation of this pending removal of ISO-dates from <T>. I think it might be important to note that this removal was not due to "pleasing Microsoft". In fact - they had already implemented support for this in CTP1. We are removing ISO-dates from Part 4 due to problems with everybody else.

I always like to give credit where credit is due, and I think this is one of those cases. Microsoft has clearly worked with - and listened to - the standardisation community and has chosen to remove a feature they had already implemented.

So what's next?

Well, Doug Mahugh recently wrote about the approach of Microsoft when dealing with OOXML<S>. Amongst other things he wrote that Microsoft Office 2010 will have read-support for OOXML<S> but that "a small number of optional features" will still be lost (that's just new-speak meaning "we haven't implemented support for all of Part 1"). I asked him what that list consisted of, and Doug said they'd provide anwsers when they have them. I hope that list will come soon.

As I have mentioned earlier, CIBER Denmark A/S (the company I work for) is not in the "productivity-suite-business" - but we develop solutions that work with these suites be that Microsoft Office, OpenOffice.org, iWork or others. Having read-support for OOXML<S> in Microsoft Office 2010 helps us a great deal, because we can now start trimming our code to target OOXML<S> instead of OOXML<T>. We think that adds great value to us and our customers. But we need a definitive list of the areas where we can expect Microsoft Office 2010 to ignore our markup. If we can't have that we are forced to go the safe-route and keep producing OOXML<T>-files and we'd hate to do that. But without a list from Microsoft, we feel that our hands are tied behind our backs.

So please, Microsoft - give us the list ASAP. Otherwise the uncertainty of what Microsoft Office will ignore is to great a risk for us to start producing strict files and your read-support for OOXML<S> is more or less useless to us.

Correct according to spec or implementation?

In the recent SC34 WG4-meeting in Stockholm, validators quickly became the talk of the town - so to speak. As I am sure you all know, Alex Brown made the office-o-tron some time ago - a validator targeting both ODF and OOXML in their ISO-editions. A few weeks ago I myself made a validator - but mine only targets OOXML in its "latest-improved-and-approved-transitional-version". Alex Brown's is written in Java and mine is written in C# .

Anyways - both Alex and I had some lengthy discussions with Microsoft about our validators and the errors they report. The thing is - there is a bug in the OOXML-specification dealing with how to specify relationship type for the part containing document properties like "author", "created on", "last modified on" etc. This part is a "central" part in OOXML, and to the best of my knowledge, there is not a single implementation out there that doesn't use this part for storing these so-called "core properties".

If you have tried to validate an OOXML-file in my validator, you'd probably have encountered this error:

Checking relationshiptype http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties ...

RelationshipType http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties is not valid. It should have been http://schemas.openxmlformats.org/officedocument/2006/relationships/metadata/core-properties.

In OOXML the "glue" tying the document and its various parts together is "relationship types". So for a given media-type (content type), a relationship type has to be used to properly register it in the package. A few relationship types are defined for the common parts of OOXML documents, i.e. for wordprocessing files, for spreadsheets, for presentations, for headers, footers etc. Some of these are defined in Part 1, section 15 and this is where the bug is. It is obviously a typo, and it has already been included in our list of fixes for the next batch.

The trick is - this has rather drastic consequences - at least from a validation point of view. Because a typo in this area will affect almost every implementation of OOXML that persists these basic data-chunks.

The thing is ... each and every document created by Microsoft Office will likely fail due to this bug in the specification.

So what are you gonna do?

Well, we discussed several different approaches.

One was simply to correct my validator to not report this error. I don't really like this idea, since it opens a flood gate of other scenarios where small corrections should take place Also, if I did want to go down that road, it should require a strategy for handling these things since I wouldn't want to correct any one error based on what Microsoft Office does - being an IBM drone and all. As of yet, I haven't been able to come up with such a strategy.

A second was to report warnings instead of errors in areas where "known bugs" were already in our to-do list of future corrections. I am not sure I like this either since it makes the validator almost impossible to maintain and it muddens the results since no-one will be able to figure out if a warning was simply a warning or a "down-graded error".

A third option is to do nothing.

I like that.

If you have tried to validate the same document using my validator and Alex's you'd probably have noticed that Alex's validator emits many more errors than mine. This is due to the fact that I use the schemas with the first batch of corrections (the so-called COR1-set). I'll update the schemas whenever the next batch of corrections are approved by either SC34 or JTC1. Alex's validator uses the schemas that was in the originally approved version of ISO/IEC 29500:2008. So my validator is already pretty "graceful" as it is.

Aonther reason that I like the idea of "doing nothing" is that it emphasizes a crucial point: A document should be valid according to the spec and not according to whatever implementation one considers "reference". There are other standards out there where we have a strange mixture of behaviour defined in the specification and behaviour buried in a "reference implementation". I don't know about you - but I'd rather have the spec be "the truth" than several gigs of source-code from whatever implementation is the pet app-de-jour at the moment.

Additionally, this shows us that all the implementations that handle this have failed in terms of feeding their experiences back to the standardisation organisation maintaining the specification. They will all have encountered this issue - but failed to report it ... unless, of course

  • they haven't looked in the spec at all [0]
  • they haven't bothered to validate their documents

The puzzling thing is - Alex and Gareth discovered this bug in January 2010 and his validator has been reporting this error for months now. I guess the answer to why neither of the implementers of OOXML has reported this bug is ... blowing in the wind.

So what I am trying to say is this: My validator stays the way it is - validating documents according to the spec. If any vendor discover a problem that is clearly an error in the spec, they should prioritize notifying us about it so we can correct it (which we will).

 

 

[0] Truth be told, prioritizing "make it work with most important implementation" is not the un-heard of. I myself, when I created my first ODF-files, didn't look in the ODF-spec. I reverse-engineered ODF-documents created by OOo since I only cared about whether OOo would eat it or not. Other implementations insist on not "supporting OOXML" but "supporting Microsoft Office output".

Final nail in the coffin for the "highlander approach"

On January 29th the Danish politicians finally got their acts together and did something about open document formats. After almost 3 years of debate and endless dragging of their feet - a consensus and agreement was finally made on that Friday.

The agreement made had this content:

For use in the public sector in Denmark, a document format can be used, if it is on the list of approved document formats. To get on the list, the document format must comply with these rules:

  • It has to be completely documented and be publicly available
  • It has to be implementable by anyone without financial-, political- and legal limitations on neither implementation nor utilization
  • It has to have been approved by an internationally recognised standardisation organisation, such as ISO, and standardised and maintained in an open forum with an transparent process.
  • It must be demonstrated that the standard can be implemented by everyone directly in its entirety on several platforms.

If you ask me, this list is pure rubbish. Apparently a deal was made on that Friday morning literally minutes before an open hearing in Parliament and this bears all signs of a job done in too much haste.

(of course, all this happened when I was away on family weekend, but as you can imagine the blog-sphere went crazy and twitter buzzed like a hive of bees with the gent's of "big blue" and "big red" taking swings at each other)

Devil in the details

The problem is that it is written all over it that this list will be taken very literally and we are going to continue to have to discuss stupid details with stupid people - instead of getting to work to start giving value to our customers.

The problems pertain to item 1) and item 4).

Item 1 is actually not that big of a deal, but it is an example of a requirement that cannot be verified. Because what does "completely documented mean"? Does it mean that a mere list of all elements and attributes is enough to give a "thumbs-up"? Does it mean that a single ocurrence of the phrase "... is application defined" provides automatic rejection? Now, I agree with the idea behind this, coz' shit has to be documented but this item should be removed or altered to provide real meaning.

And what about item 4) ?

Well the problem with this is that no document standard of today can be said to comply with this requirement - thereby making the list Ø. The only way a document format can be said to comply with this would be to have 2 independant applications, each claiming to be implementing the specification in "its entirety". And still we wouldn't be able to actually prove it. We would, at best, be able to show that with high likelyhood the applications do actually implement the specification "in its entirety".

Two to go ...

So that basically leaves us with two requirements. The only requirement we should think of adding would be "It has to be relevant in the market" ... ODA, anyone?

The silver lining

But do not fret - it is not all bad. No, because the agreement effectively puts the final nail in the coffin for the "there can be only one document format"-line of thought. The Danish parliament has has turned its back on any exclusivity with regards to document formats and has turned its focus to "open standards". This is no doubt a positive move, because now it doesn't make any sense any more to argue "which one is bigger (or, smaller)" or over who got to the playground first.

With this decision Denmark follows other countries like Norway, Belgium and Holland where the notion of "open standards" is also the center of thought - and who have also discarded the idea of "value can only come if we only have one document format". This is fantastic - and I applaud our politicians on making this decision - even though some of the details lacked consideration.

Smile

Moving towards OOXML(S)

Some time ago I wrote a bit about what Microsoft Office had managed to get into Microsoft Office 2010 CTP1 (or, I wrote about the stuff I had tested). As you might recall, the results were rather slim, so I wrote to Microsoft to hear, if that was really it. It has been the fear of many that Microsoft will never, ever care at all about the strict conformance clause of ISO/IEC 29500, and my tests clearly was a sign that they were right. Heck, some even mentioned that "the only choice for Microsoft is to avoid adding new BRM features in their OOXML files".

On the other hand I have always regarded big companies like Novell, IBM, ORACLE etc as rather simplistic in their development cycles - that they'll always choose the path of least resistance. Microsoft is in no different here, and moving towards strict and side-tracking nasty. legacy stuff like VML etc is clearly an attempt to make the developmentpath easier in the future.

The list I got was this:

File typeFeatureComment 
DOCX Ink Drawings Previously used VML, now uses DrawingML
Advarsel, gult trafiklys
XLSX Ink Drawings Previously used VML, now uses DrawingML Advarsel, gult trafiklys
PPTX Ink Drawings Previously used VML, now uses DrawingML Advarsel, gult trafiklys
DOCX Legacy Diagrams Previously used VML, now uses DrawingML Advarsel, gult trafiklys
XLSX Legacy Diagrams Previously used VML, now uses DrawingML Advarsel, gult trafiklys
PPTX Legacy Diagrams Previously used VML, now uses DrawingML Advarsel, gult trafiklys
DOCX Drawing Shapes Previously used VML, now uses DrawingML Succes, grønt trafiklys
DOCX Textboxes Previously used VML, now uses DrawingML Succes, grønt trafiklys
DOCX WordArt Previously used VML, now uses DrawingML Succes, grønt trafiklys
DOCX Groups Previously used VML, now uses DrawingML Advarsel, gult trafiklys
XLSX Form Controls Previously used VML, now uses DrawingML - except on "chart sheets" Advarsel, gult trafiklys
XLSX ActiveX Objects Previously used VML, now uses DrawingML Advarsel, gult trafiklys
PPTX ActiveX Objects Previously used VML, now uses DrawingML Advarsel, gult trafiklys
XLSX OLE Objects Previously used VML, now uses DrawingML Succes, grønt trafiklys
DOCX ST_OnOff Uses the new ISO-approved simple type without the values "on" and "off" Succes, grønt trafiklys
XLSX ST_OnOff Uses the new ISO-approved simple type without the values "on" and "off" Succes, grønt trafiklys
PPTX ST_OnOff Uses the new ISO-approved simple type without the values "on" and "off" Succes, grønt trafiklys
XLSX ISO-dates Can persist dates in ISO-8601 format and avoids the "evil" serial dates. Succes, grønt trafiklys

(The last four was addeed by me and didn't appear on the list from Microsoft)

Now, "someone" once wrote to me that you shouldn't make any decisions based on what Microsoft Office says they will do - you should wait until they actually do act. I couldn't agree more, so I tried to test the list I received. I have tested the lines marked with a green traffic light by first creating a document in Microsoft Office 2007 to verify the usage of e.g. VML and then I created the same document in Microsoft Office 2010 Beta [']. The lines marked with yellow traffic lights have not been tested by me, since I frankly don't have the Office-skills to create a file (what the hell is an "Ink Drawing, btw?). If anyone can test this, I'd be happy to update the list. Also, regarding the lines about ST_OnOff, I have tried to create files that would contain the "bad" On/Off-values, but I haven't succeeded in this. That is not the same as deterministically verifying that it cannot be done, so again - if you can create a file in Microsoft Office 2010 with the bad values, send it to me and I'll update the list.

So getting back to "don't trust Microsoft as far as you can throw them", this is in no way a definitive list. The list is based on Microsoft Office 2010 Beta, and much can happen until final RTM - both in the right direction with even more things being fixed, but also in the wrong direction with things being pulled off the list again (WinFS, anyone?). But for those of us not implementing complete Office suites but "merely" interacting with the ecosystem by generating files, this is undoubtly good news. Add to this that Microsoft confirmed a few TC-calls ago in WG4, that pending the current AMD1-ballot, Microsoft would add the new namespaces of strict files to the white-list of known namespaces in Office 2010. This effectively means that Microsoft Office will be able to load (some) strict files, and if you just happen to generate PPTX-files with embedded objects, you'll likely never again have to generate markup like this:

[code:xml]<w:object w:dxaOrig="15" w:dyaOrig="15">
    <v:shapetype
        id="_x0000_t75"
        coordsize="21600,21600"
        o:spt="75"
        o:preferrelative="t"
        path="m@4@5l@4@11@9@11@9@5xe"
        filled="f"
        stroked="f">
        <v:stroke joinstyle="miter"/>
            <v:formulas>
                <v:f eqn="if lineDrawn pixelLineWidth 0"/>
                <v:f eqn="sum @0 1 0"/>
                <v:f eqn="sum 0 0 @1"/>
                <v:f eqn="prod @2 1 2"/>
                <v:f eqn="prod @3 21600 pixelWidth"/>
                <v:f eqn="prod @3 21600 pixelHeight"/>
                <v:f eqn="sum @0 0 1"/>
                <v:f eqn="prod @6 1 2"/>
                <v:f eqn="prod @7 21600 pixelWidth"/>
                <v:f eqn="sum @8 21600 0"/>
                <v:f eqn="prod @7 21600 pixelHeight"/>
                <v:f eqn="sum @10 21600 0"/>
            </v:formulas>
            <v:path
                o:extrusionok="f"
                gradientshapeok="t"
                o:connecttype="rect"/>
        <o:lock v:ext="edit" aspectratio="t"/>
    </v:shapetype>
    <v:shape
        id="_x0000_i1025"
        type="#_x0000_t75"
        style="width:.75pt;height:.75pt"
        o:ole="">
        <v:imagedata r:id="rId4" o:title=""/>
    </v:shape>
    <o:OLEObject
        Type="Embed"
        ProgID="opendocument.WriterDocument.1"
        ShapeID="_x0000_i1025"
        DrawAspect="Content"
        ObjectID="_1327745060"
        r:id="rId5"/>
</w:object>[/code]

['] I have tried, in vain, to get my hands on the latest pre-release edition of Microsoft Office 2010 ... so much for being a drone, when you can't get your hands on the latest bits Frown

 

Microsoft Office 2010 Beta, ODF and leap-year-bug

Some time ago I did some tests of Excel in Microsoft Office 2010 (CTP). The test was around OOXML - but test of ODF-support was missing.

One of the things ODF is missing but is in OOXML is the leap-year-bug ... although most of propably don't miss it all that much. The leap-year-bug is the good ol' Lotus 1-2-3 bug that treated 1900 as a leap year. As a consequence of that, calculations based on dates in the range from January 1st 1900 and February 28th 1900 with dates after this period will be off with one day.

Since Microsoft Office supports (a subset of) ODF, I thought it'd be fun to look at how Excel 2010 handles the leap-year-bug.

The first thing to do is to show how the leap-year-bug is handled by Excel:

So adding a day to February 28th 1900 will result in the non-existing date February 29th 1900, and if you subtract the dates February 27th 1900 and March 2nd 1900 (you'd expect the a value of 3) you actually get a value of 4.

So what will happen if you save this spreadsheet in ODF-format and open it again in Excel? You might expect that - since it was round-tripped through a format not supporting the leap-year-bug, the calculations would now be correct.

... but you'd be wrong. The result is excatly the same:

As I was, you might be wondering how the hell that was possible. But a simple inspection of the markup generated by Microsoft Excel 2010 reveals the answer:

[code:xml]<table:table-cell

  office:value-type="date"

  office:date-value="1900-02-29T00:00:00"

  table:formula="msoxl:=A2+1"

  >
  <text:p>29-02-1900</text:p>

  </table:table-cell>[/code]

A quick-and-dirty conclusion to this would be that Microsoft Excel 2010 violates not only ODF but also xsd:datetime, since February 29th is not a valid xsd:datetime. However, an inspection of ODF reveals that this is not the case. Microsoft Office claims conformance to ODF 1.1. and ODF 1.1 states the following about the value-space of the attribute "office:date-value" (Section 16.1 , p 702) :

A dateOrDateTime value is essentially an [xmlschema-2] date and time value with an optional time component. In other words, it may contain either a date, or a date and time value.

So strictly (*giggle*) speaking, Microsoft Office 2010 does not violate ODF 1.1 .

However - specifying an invalid date in an attribute that might contain xsd:dates is not very smart, dear Microsoft. Those of us wanting to use standard libraries to process the content of an ODF-document will likely get unpredictable results when trying to parse this invalid date. Heck, even .Net's DateTime.Parse()-method throws an exception when trying to parse this value.

Also, ODF TC has tightened up the prose in ODF 1.2 and it is now:

A dateOrDateTime value is either an [xmlschema-2] date value or an [xmlschema-2] dateTime value.

So Microsoft Office 2010 might not violate it now - but it will when ODF 1.2 comes out.

Extending ODF

Microsoft could always opt for extending ODF using the extension mechanism (to add elements and attributes using a foreign namespace). So Microsoft could chose to add their own attribute to the <office:spreadsheet>-element saying something like

[code:xml]<office:spreadsheet mso:EnableLeapYear="true"/>[/code]

The problem with this approach is that is comes into conflict with the new conformance clauses of ODF where a clear distinction between "normal" documents and "extended" documents is made. Procurement-wise it is a big no-no only to support extended documents (look what happened in Denmark!) and Microsoft risks that some government somewhere decides not to use Microsoft Office due to lack of conformance to the "normal" conformance clause of ODF 1.2.

Thus, Microsoft needs to find another solution ...

Configuration to the rescue!

Luckily for Microsoft (and we all know how picky they are wrt "preserving functionality" etc), there is a fully compliant way out of this while still preserving the leap-year-bug in spreadsheets - regardless of persistance format.

As you probably know that so-called config-item-sets are a gold-mine of endless possibilities. Originally (until ODF 1.1) the purpose of these elements and attributes were to store application specific settings, like (and this is a quote from ODF 1.1) "document settings, for example a default printer or view settings, for example zoom level". In ODF 1.2, all bets are off and there are no restrictions to the usage of the elements. The config-item-set elements were never meant to be an extension mechanism (by ODF TC co-chair from Sun/ORACLE - go figure), but OpenOffice.org uses them extensively - in fact, when creating a "blank" text-document, spreadsheet or presentation in OpenOffice.org, a total of 228 (76 for text documents, 66 for spreadsheets and 86 for presentations) settings (of which non are described in ODF) are defined in the the settings.xml-file of the packages. Somehow ODF TC has not found it necessary to include usage of config-item-sets in the "extended conformance clause", so a document can claim 100% conformance to ODF 1.2 "normal documents" while throwing dozens of settings into config-item-set elements. So the solution to claim conformance to ODF while enabling the leap-year-bug is simply

[code:xml]<config:config-item-set  config:name="mso:spreadsheet-settings">
  <config:config-item

    config:name="EnableLeapYearBug"

    config:type="boolean"

  >

    true

  </config:config-item>
</config:config-item-set>[/code]

This should be combined with this markup for the specific cell

[code:xml]<table:table-cell

  office:value-type="float"

  office:-value="60"

  table:formula="msoxl:=A2+1"

  >
  <!--<text:p>29-02-1900</text:p>-->

  </table:table-cell>[/code]

(and you don't really need the bit I have commented out).

I don't know about you, but I find this just darn right fantastic!

Smile

Excel 2010 (Microsoft Office 2010 CTP TO-do list (01)

I have been looking at how Excel 2010 has implemented various features using ISO/IEC 29500-4:2008 - also known as "OOXML Transitional".

Background:

ISO/IEC comes in two variants, a "transitional" (T) and a "strict (S). Transitional is the one containing all the legacy stuff such as VML, legacy digest algorithms, leap year bug etc. S does not contain these things and is therefore considered "more pure". T is practically identical to the document format submitted to ISO/IEC by ECMA - also known as "ECMA-376". A document conforming to ECMA-376 will therefore also conform to T, since the schemas are practically identical.

T is currently a superset of S, which means that "T includes everything in S". This has the effect that within a T document, a vendor can take advantage of new features in S while still being in "the comfort zone of T". T is therefore considered by some as providing a “graceful migration path to S”, meaning that vendors can change their existing T-compliant code, on a case-by-case basis, to gradually support the features of S instead of those of T.

Update 2009-11-17: By reading in the spec today, I realised that I was wrong in saying that the leap-year bug was not in S - indeed it is. I appologize for the confusion. Thank you, Jens Hørlück for pointing this out to me.

As you know, Microsoft is claiming "IS29500 compliance" for their latest incarnation of Microsoft Office. OOXML is a big and complex document format, and even though OOXML was not published until waaaay after "feature freeze" of Office 2010, I thought it'd be interesting to take a look at how Excel has reached compliance.

When looking at the markup generated by Excel 2010, it is important to remember the background information above, since you'd expect to find traces of this "graceful migration" in the markup already.

Conformance class

The DIS-process added a new attribute to the root elements of the documents. This element would specify the conformance level of the document. As to not invalidate existing documents, the default value for this attribute was “transitional”. The other possible value is "strict".

So when creating a new spreadsheet using Excel 2010, which markup does it create?

[code:xml]<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<workbook>
  <sheets>
    <sheet name="Sheet1" sheetId="1" r:id="rId1”>
  </sheets>
</workbook>[/code]

(condensed for easier reading)

The above is the markup for the root element of a SpreadsheetML spreadsheet.

So the conformance class attribute is omitted – making the document a T-document. I know that it is basically nonsense to add an attribute with the default value to implement support for this behavior, but for those of us that like looking at markup in text editors, I’d recommend Microsoft to include the conformanceClass-attribute with the appropriate value. As it is a no-brainer to implement for a consumer – it should be a no-brainer to implement for a producer.

Also – the existence of the conformanceClass-attribute is the only (or, best, anyway) indicator that a T document was created according to the schemas of ISO/IEC 29500 and not those of ECMA-376.

(and yes, I know that they are practically identical, but adding the attribute would be a clear indication that Microsoft wishes to produce markup according to ISO/IEC 29500 and not ECMA-376 1st Ed.)

Microsoft, please fix!

ISO-dates

Support for ISO-dates in SpreadsheetML just might be the biggest issue we faced in the DIS-process. It was hugely controversial and as such, a large amount of time was used at the BRM to figure out a solution.

Excel 2010 allows a user to manually chose the option to persist dates in ISO-8601 format. This is done through the "Settings-dialogue" as

So let us see what Excel 2010 does with dates. I opened the application and in a cell I wrote the date 28-02-1900. I then unzipped the XLSX-file and found this markup in the Worksheet-part:

[code:xml]<sheetData>

  <row r="1" >
    <c r="A1" t="d">
      <v>1900-02-28</v>
    </c>
  </row>
</sheetData>[/code]

So the date is actually persisted in the file using ISO-8601 notation and the cell is correctly typed with a t=”d”-declaration.

Now, depending on your point of view, this is a good thing, because it takes us further from the hell-hole I have previously written about with respect to “date-typing-of-numbers” in ECMA-376.

Thank you, Microsoft – good for you!

Smile

Leap-year bug

So … with the “Save-as-ISO- 8601-dates”-setting to “yes, please” as previously described, I added a formula to the spreadsheet above in cell B1. The formula simply was “=A1+1”.

Anyone up for guessing the result?

Well, the result was this:

That is funny because of two things:

  1. I set the “Save-as-ISO Dates”-setting to “yes”, so I was not expecting the leap-year-bug to still be used.
  2. I am pretty sure that the date-representation “1900-02-29” is not a valid ISO-date.

So I looked at bit at the markup generated for this spreadsheet.

The result was:

[code:xml]

<sheetData>
  <row r="1" >
    <c r="A1" t="d">
      <v>1900-02-28</v>
    </c>
    <c r="B1">

      <f>A1+1</f>
      <v>60</v>
    </c>
  </row>
</sheetData>[/code]

Ahem … dear Microsoft … WTF?

So you have persisted the first date as an ISO-date,

but the result of calculating a value based on it – you persist the result as a serial date? In which universe would this make sense?

(I should note that whenever doing calculations resulting in dates not being invalid, the date is correctly persisted by Excel 2010 as an ISO-8601 date)

But – as I was writing this I thought “wasn’t there something about an at-BRM-added attribute called “dateCompatibility”? This was the attribute used to determine if the date-system used should support the leap-year-bug or not. And truth be told, the attribute is not used – making the dateCompatibility default to “true” – hence honoring the leap-year-bug.

But – this could simply be a glitch of Excel2010, so I added the attribute myself to force Excel2010 to discard date compatibility. I’d then suspect the result of adding 1 to the date of 1900-02-28 would be 1900-03-01.

The markup was this:

[code:xml]<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<workbook dateCompatibility="false">
  <sheets>
    <sheet name="Sheet1" sheetId="1" r:id="rId1”>
  </sheets>
</workbook>[/code]

Result: still 1900-02-29

I this calls for a few conclusive words before I move on:

Microsoft, I think you should stop using ISO-8601 dates in SpreadsheetML in T because of the arguments provided by my fellow WG4-expert Gareth Horton. But if you insist on using ISO-8601-dates in SpreadsheetML-T, you should do it right,. If you ask me, a user manually choosing to have Excel2010 use ISO-8601 dates, is a clear indication that the user does not want to deal with the leap-year-bug. The user has even decided to continue despite your warning that he or she might lose precision. So when a user performs this specific action, set the damn dateCompatibility-attribute to “false” and persist ALL dates as ISO-8601 dates.

You have a golden opportunity here to ditch the leap-year-bug for all new documents. If you act wisely and add the conformance attribute (so consuming applications can distinguish the files from ECMA-files) and setting the dateCompatibility-attribute to “false” when persisting dates as ISO-8601 dates, you’d have done really, really good. You have the by far biggest implementation of OOXML – so the vendors will follow your lead. Any application adding support for ISO-8601 dates in transitional documents in their existing ECMA-376-compliant code, will be able to kno

w what to do - and they'll do it the way you do it.

Now, I know you would like to demonstrate to all of us that you want to go towards “the S way”, but you need to do it right. Mixing ISO-8601 dates with serial dates simply to be able to maintain the leap-year-bug is not the right thing to do.

I have absolutely no idea of how the internals

of either the calculation instructions in Excel works – or even the Excel-team itself, but the way you have “added” ISO-8601 dates indicates to me, that you have changed barely anything except for the persistance-mechanism for dates.

I’m sure you’d argue that there is no longer time to change the inner workings of Excel2010, and you might very well be right about that. If that is indeed the case, I suggest you remove the option of saving dates as ISO-8601-dates from Excel2010 as soon as possible to avoid doing any more harm to the ecosystem around OOXML that we both share deep concerns for.

VML

At the BRM the biggest pile of things added to OOXML was where a feature was previously only possible with the use of VML. Of these include adding comments to cells in spreadsheets. The comments themselves are stored in a “Comment part” of the OPC-package, but the display of the comment was done using VML. Luckily, at the BRM markup was added to allow DrawingML to be used instead.

So I used the spreadsheet from before and added a comment in cell B2 (the one with the crappy leap-year-bug result)

.

Anyone wanna guess if this comment is displayed using VML or DrawingML?

Yes, you guessed it … somewhere in the back of the document is a VML-fragment containing the “box” containing my comment.

[code:xml]<xml xmlns:v="urn:schemas-microsoft-com:vml"

 xmlns:o="urn:schemas-microsoft-com:office:office"
 xmlns:x="urn:schemas-microsoft-com:office:excel">
 <o:shapelayout v:ext="edit">
  <o:idmap v:ext="edit" data="1"/>

 </o:shapelayout><v:shapetype id="_x0000_t202" coordsize="21600,21600" o:spt="202"
  path="m,l,21600r21600,l21600,xe">
  <v:stroke joinstyle="miter"/>
  <v:path gradientshapeok="t" o:connecttype="rect"/>
 </v:shapetype><v:shape id="_x0000_s1025" type="#_x0000_t202" style='position:absolute;
  margin-left:120.75pt;margin-top:1.5pt;width:126pt;height:55.5pt;z-index:1;

  visibility:hidden;mso-wrap-style:tight' fillcolor="#ffffe1" o:insetmode="auto">
  <v:fill color2="#ffffe1"/>
  <v:shadow on="t" color="black" obscured="t"/>
  <v:path o:connecttype="none"/>

  <v:textbox style='mso-direction-alt:auto'>
   <div style='text-align:left'></div>
  </v:textbox>
  <x:ClientData ObjectType="Note">

   <x:MoveWithCells/>
   <x:SizeWithCells/>

   <x:Anchor>
    2, 15, 0, 2, 4, 55, 3, 16</x:Anchor>

   <x:AutoFill>False</x:AutoFill>
   <x:Row>0</x:Row>
   <x:Column>1</x:Column>
  </x:ClientData>
 </v:shape></xml>[/code]

If this was a tweet on Twitter, I’d finish it off using the tag #fail.

Document protection

And finally – what about document protection? “Document protection” is the mechanism in OOXML to allow applications to open documents in “read-only-mode”, to protect worksheets from editing etc. These are not really protecting the document as such, because a hash of the password is simply saved – but everything is still in clear text. Document protection was one of the areas where OOXML was fundamentally changed, allowing more robust algorithms to be used – and not only the weak “legacy-algorithm” previously supported by Office 2007.

So in Excel2010 I chose to protect the active sheet and looked at the markup again. Again, I was hoping to find some of the new markup, but again I was let down. Excel2010 still uses the weak algorithm and it still uses the legacy markup to persist it.

[code:xml]<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>

<worksheet>

  <sheetData/>
  <sheetProtection password="8985" sheet="1" objects="1" scenarios="1" />
</worksheet>[/code]

This is not only an annoyance for those not working on the Windows-platform – it is also an annoyance for those of us working with e.g. .Net development. If Excel2010 used SHA1, SHA256 or one of the other modern algorithms supported directly by .Net framework, we’d be able to use it straight out of the box. Now all of us still need to add additional code/assemblies to our code – simply to protect a whorksheet.

At the end …

I must admit that I am a bit disappointed. I know that barely a year has gone since the publication of ISO/IEC 29500, but I had expected more of the things we discussed at the BRM to be implemented. At the very least I was expecting some of the more “easy ones” to have been implemented – like the conformanceClass-attribute, VML in comments and document protection. It would have been a nice token of “good faith”.

Now, we are basically left with nothing. Do note, however, that the observations above are by no means a comprehensive test. These represent simply a few of the thoughts on the top of my head while writing this.

Next time: WordpressingML

PS: I have been trying to find a list of the stuff from the BRM that was implemented in Office 2010 (like ISO-8601 dates), but I have been unsuccessful. So if anyone can point me towards such a list, I'd be happy to include those results above.

Denmark votes "yes" on IS29500 COR1 and FPDAM1

I know it has been a couple of weeks, but I just wanted to share current development with you.

On September 7th (in Danish), the Danish mirror committee to ISO/IEC JTC1 SC34 met at Danish Standards in Charlottenlund. On the agenda was, amongst other things, processing of documents under ballot. The relevant documents to WG4 was these

As appointed expert from Danish Standards in WG4, I have been working hard with the other experts in WG4 on these papers and I have for each meeting in Denmark provided oversights to the mirror committee on the current work. The members of the Danish committee have access to the same set of papers that I have, so we have primarily been discussing the more controversial ones - like usage of ISO-8601 dates in transitional files, reintroducing ST_OnOff in transitional schemas and changing the namespace name for strict files. A couple of times Danish committee members have requested information on more "trivial stuff", and we have then discussed this.

At the meeting of September 7th, I gave a quick sporadic overview of the more tough parts of COR1 and AMD1 and no comments were presented. We talked a bit about general principles of the work in WG4, but that was basically that.

After this, Denmark (Danish Standards) approved the document sets for COR1 and AMD1.

Obviously I think this is great news and the chairman of the Danish committee expressed his appreciation of the work put into creating these files.

Danish Competition Authority suggests: Use ODF in public sector!

Get the information straight from the horse's mouth from DCA website. If you are not speaking Danish, Google will do a rough translation for you.

I'll update this article shortly ...

... oh ... and I almost forgot ... they suggested using OOXML as well.

Smile