Correct according to spec or implementation?

by jlundstocholm 22. April 2010 00:52

In the recent SC34 WG4-meeting in Stockholm, validators quickly became the talk of the town - so to speak. As I am sure you all know, Alex Brown made the office-o-tron some time ago - a validator targeting both ODF and OOXML in their ISO-editions. A few weeks ago I myself made a validator - but mine only targets OOXML in its "latest-improved-and-approved-transitional-version". Alex Brown's is written in Java and mine is written in C# .

Anyways - both Alex and I had some lengthy discussions with Microsoft about our validators and the errors they report. The thing is - there is a bug in the OOXML-specification dealing with how to specify relationship type for the part containing document properties like "author", "created on", "last modified on" etc. This part is a "central" part in OOXML, and to the best of my knowledge, there is not a single implementation out there that doesn't use this part for storing these so-called "core properties".

If you have tried to validate an OOXML-file in my validator, you'd probably have encountered this error:

Checking relationshiptype http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties ...

RelationshipType http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties is not valid. It should have been http://schemas.openxmlformats.org/officedocument/2006/relationships/metadata/core-properties.

In OOXML the "glue" tying the document and its various parts together is "relationship types". So for a given media-type (content type), a relationship type has to be used to properly register it in the package. A few relationship types are defined for the common parts of OOXML documents, i.e. for wordprocessing files, for spreadsheets, for presentations, for headers, footers etc. Some of these are defined in Part 1, section 15 and this is where the bug is. It is obviously a typo, and it has already been included in our list of fixes for the next batch.

The trick is - this has rather drastic consequences - at least from a validation point of view. Because a typo in this area will affect almost every implementation of OOXML that persists these basic data-chunks.

The thing is ... each and every document created by Microsoft Office will likely fail due to this bug in the specification.

So what are you gonna do?

Well, we discussed several different approaches.

One was simply to correct my validator to not report this error. I don't really like this idea, since it opens a flood gate of other scenarios where small corrections should take place Also, if I did want to go down that road, it should require a strategy for handling these things since I wouldn't want to correct any one error based on what Microsoft Office does - being an IBM drone and all. As of yet, I haven't been able to come up with such a strategy.

A second was to report warnings instead of errors in areas where "known bugs" were already in our to-do list of future corrections. I am not sure I like this either since it makes the validator almost impossible to maintain and it muddens the results since no-one will be able to figure out if a warning was simply a warning or a "down-graded error".

A third option is to do nothing.

I like that.

If you have tried to validate the same document using my validator and Alex's you'd probably have noticed that Alex's validator emits many more errors than mine. This is due to the fact that I use the schemas with the first batch of corrections (the so-called COR1-set). I'll update the schemas whenever the next batch of corrections are approved by either SC34 or JTC1. Alex's validator uses the schemas that was in the originally approved version of ISO/IEC 29500:2008. So my validator is already pretty "graceful" as it is.

Aonther reason that I like the idea of "doing nothing" is that it emphasizes a crucial point: A document should be valid according to the spec and not according to whatever implementation one considers "reference". There are other standards out there where we have a strange mixture of behaviour defined in the specification and behaviour buried in a "reference implementation". I don't know about you - but I'd rather have the spec be "the truth" than several gigs of source-code from whatever implementation is the pet app-de-jour at the moment.

Additionally, this shows us that all the implementations that handle this have failed in terms of feeding their experiences back to the standardisation organisation maintaining the specification. They will all have encountered this issue - but failed to report it ... unless, of course

  • they haven't looked in the spec at all [0]
  • they haven't bothered to validate their documents

The puzzling thing is - Alex and Gareth discovered this bug in January 2010 and his validator has been reporting this error for months now. I guess the answer to why neither of the implementers of OOXML has reported this bug is ... blowing in the wind.

So what I am trying to say is this: My validator stays the way it is - validating documents according to the spec. If any vendor discover a problem that is clearly an error in the spec, they should prioritize notifying us about it so we can correct it (which we will).

 

 

[0] Truth be told, prioritizing "make it work with most important implementation" is not the un-heard of. I myself, when I created my first ODF-files, didn't look in the ODF-spec. I reverse-engineered ODF-documents created by OOo since I only cared about whether OOo would eat it or not. Other implementations insist on not "supporting OOXML" but "supporting Microsoft Office output".

Comments

4/25/2010 7:49:19 PM #

Alex Brown

@Jesper

I completely agree that a validator should be a neutral reporter of problems wherever possible, and that we should not be going down the road (taken by some ODF validators) of special-casing certain applications' documents to downplay errors against the spec.

Strictly speaking, I think Office-o-tron is correct in validating OOXML documents against the published version of ISO/IEC 29500:2008. Although the first set of corrigenda has been approved, it is still probably a few months away from publication and so is not yet a bona fide part of the Standard.  I do think it's of interest to users that our validators take this slightly different approach --but we will need to be very careful to make it clear what version(s) of the specs we target when we validate. One can only hope the office suite vendors raise their game and will be equally fastidious in this respect; currently we should be suspicious of implementations claiming vaguely to support "IS 29500", "ODF" or "ODF 1.2" ...

Alex Brown United Kingdom |

4/26/2010 4:48:07 AM #

jlundstocholm

Hi Alex,

Well - strictly (no pun intended) speaking myself, I think the question of when the corrections have been published or not is mostly a personal preference. Validation-wise it shouldn't matter that much.

currently we should be suspicious of implementations claiming vaguely to support "IS 29500", "ODF" or "ODF 1.2" ...

Yeah, but that is what vendors do. Most ODF-implementations today claim that they support "The ISO-standard" even though none of them can actually persist data in this format. Now that I think about it, it is very much like Microsoft Office 2010 and OOXML<S>. It loads the stuff just fine - but it cannot persist it. Given the turmoil of this lately, how's that for hypocrisy?

Smile

I will see to that my validator does at better job at displaying the correct version number.

PS: Look - nested comments ...

jlundstocholm Denmark |

4/26/2010 2:45:46 AM #

Ian Easson

Hello all.

Of course you have to validate against the written standard.  Having said that, however, there is a practical matter of the user not wanting to be bothered sometimes with such messages.

The practical solution is for validators to have an OPTIONAL command line parameter, allowing the user to tell it to ignore any such known errors in the standard that already queued up for fixing.

Ian Easson Canada |

4/26/2010 4:50:04 AM #

jlundstocholm

Hi Ian,

The practical solution is for validators to have an OPTIONAL command line parameter, allowing the user to tell it to ignore any such known errors in the standard that already queued up for fixing.


I think the implementation of this will need some tweaking, but I like the idea of presenting the user with som options for validation. Alex has already this in his validator where some options can be checked before validation.

Smile

jlundstocholm Denmark |

4/27/2010 1:42:55 PM #

Rick Jelliffe

The more a validation message can indicate the following, the more workable it will be:

1) What it is in clear language
2) Why does it matter? (what is the business rule or context?)
3) Why is an error (who says? against which standard or dialect, exactly)
4) What its extent is (will this cause products to fail?  is it just naughty?)
5) What to do about it (repair? ignore? SMS Steve Jobs?)
6) What kinds of person are interested in this (programmers?  end-users?)
7) In what situations is this actually an issue?

While we could imagine better conventions for these, our current crop of grammar-based schema languages are still utterly incapable of representing even *one* of these.

The thing is, why should we be expecting that these complex systems of documents can be kept under control when our schema technology does not support ideas such as "dialect" or "Postel's law" or "graceful degradatation" as first class citizens?  

For example, in Jesper's case here, why cannot there be a way to mark the incorrect relationship type in the schema as being a known error?

Rick Jelliffe United States |

4/27/2010 11:29:04 PM #

Rob Weir

Do you want to be right or be useful?  We only need one validator to be right.  But we sure could use some useful ones.

It is interesting to look at other programmers proofing tools, from the old lint, to codecheck to compiler errors.  The lessons we've learned over the years is:

1) It is best to have your code compile "clean", i.e., with no errors or warnings.  That way any new error stands out immediately and catches your attention.

2) However, #1 is not always possible, perhaps because you are reusing a library that gives warnings, or for some other reason there is an expected level of "background noise" that is not under your immediate control.

3) So you are in a position where the "background noise" makes it hard to detect new and unexpected errors.   What do you do?  Disable all errors of that class?  No, that is overkill, since you would then miss new errors of that type that are unexpected.

4) Good solutions over the years include: a) specifying an error count by type.  So expect (and ignore) 5 warnings of this type and report if more than that occur.   b) Better yet, specify the location of the expected errors (by XPath perhaps) and report any that go beyond that.  c) Use inline directives in the source, to disable/enable warnings.  Microsoft did that with "pragma" directives in their languages.  Could be done with processor directives in XML.

Rob Weir United States |

5/31/2010 6:36:51 AM #

jlundstocholm

Hi Rob,

Do you want to be right or be useful?  We only need one validator to be right.  But we sure could use some useful ones.

Isn't that another way of saying: "Be strict - but do a better job at communicating what you find"?

I like the idea of grouping errors - and I think I for sure need to empower the user to decide if all warnings should be ignored, reported or grouped ... or a combination thereof.

Thanks for your feedback Smile

jlundstocholm Denmark |

5/31/2010 10:07:01 AM #

Rob Weir

More like, "It is better to have all interested parties work together on a single feature-rich validator than to have four sucky ones".

Rob Weir United States |

5/31/2010 5:24:04 PM #

jlundstocholm

Hi Rob,

More like, "It is better to have all interested parties work together on a single feature-rich validator than to have four sucky ones".

I couldn't agree more Smile

jlundstocholm Denmark |

4/28/2010 8:16:23 AM #

Ian Easson

This is actually turning out to be a useful discussion.

I support Rob's idea of looking to the decades of experience with source code analyzers, so that the user of the validator can have practical control over the types of warnings and error messages that result.  The user of a validator can have a wide range of objectives.  Any other approach amounts to the validator assuming what the user's objective's are, which is arrogant.  

Ian Easson Canada |

4/28/2010 5:04:11 PM #

Alex Brown

@Ian @Rob

There is plenty of experience of XML validation management out there, it's just that it has been applied to document format validation yet.

The "best practice" approach, IMHO, is to assign each constraint an ID and then hang a load of metadata off that (error message - in different languages; URL for further info; severity level; metadata about which versions the constraint applies to, etc.). Validation reports themselves should then be in XML -- this enables all kinds of useful styling info, such as filtering and grouping errors.

The right tool for the job is Schematron, and there is work taking place (both inside and outside SC 34) to make progress in this area. Stand by!

Alex Brown United Kingdom |

Comments are closed