Validating OOXML documents

by jlundstocholm 1. April 2010 21:37

The question of “is this a valid document?” is tricky. At the end of the day it comes down to the description in the conformance clauses of the specification of the document being considered. The conformance clauses of OOXML is listed in Section 2 of Part 1 and Section 2 of Part 4. There are also conformance clauses for Part 2 and Part 3, but they are not really relevant for this post.

The basic requirements for “document validity” can be summarized in these two points:

  • The markup must correspond/obey to the schemas of the specification
  • The markup must correspond/obey to any semantic and syntactic constraints of the specification

The first bullet is the easy one to check – because all you need to do is to validate the markup. The second bullet is much harder and it is almost impossible to automatically perform such a validation.

But since the first requirement is so easy to test, one could argue that at the bare minimum, a document producer MUST be able to create documents that are valid to the corresponding schemas.

To be able to test this, I implemented an OOXML document schema validator.

It turned out, that I should have done this from “day one” of my work with OOXML, because trying to implement a validator revealed a lot of information about how the document is structured and put together – a knowledge that really comes nicely in hand when trying to implement a document generator.

My approach was this:

  • Implement a (as much as possible) generic tool to validate documents
  • Use the latest, approved version of the schemas
  • Implement a web front-end to allow anyone to use it from anywhere.
  • Open-source the stuff

Originally I based the validator on OpenXML SDK 2.0, but during the implementation I realized that first of all it seemed a bit too “Microsoft Office dependant”. Secondly I could not get access to all the necessary information in the OPC-package that I needed to validate since the SDK hides some of this information (and rightly so, if you ask me) and thirdly it turned out that I didn’t need it at all. OpenXML SDK is based on System.IO.Packaging and since I use .Net to implement this, I found this a much better tool for the job.

What does it do?

The validator performs these tasks:

  1. It checks if the media types (MIME types) of each part is listed in the specification
  2. It checks if the relationship-type of the relationship file is listed in the specification
  3. It checks if each part referenced exists in the package in the correct location
  4. It checks the content (markup) of each part against the transitional schemas of the specification

What doesn’t it do?

The validator does not do the following things:

  • Support validation of documents containing extensions using MCE
  • Support documents in files with extensions not being either “docx”, “xlsx” or “pptx”.
  • Support validation against the strict schemas of OOXML
  • Support validation of the “root” package entry, being the file [Content_Types].xml

Other tools:

As you probably know Alex Brown has made the “office-o-tron”, which is a SAX/Java-based document validator. The differences between this tool and mine are summarized here:

Task Office-o-tron OOXML Validator
Validates OOXML documents x x
Validates OOXML <T> documents x x
Validates OOXML <S> documents
Validates against ISO/IEC 29500:2008 x
Validates against ISO/IEC 29500:2008 COR 1
x
Supports MCE
Inspects package of document (ZIP container) x
Validates ODF documents x

 

The only major difference (OOXML-wise) is really that office-o-tron validates against the core, base schemas of ISO/IEC 29500:2008 whereas the OOXML Validator validates against the set of schemas with the first set of approved corrigenda (COR1). Whenever a new set has been approved as either an amendment or a corrigendum, the schema sets will be updates accordingly.

I'll update this article with some of the details revealed during the creation of the validator - until then, have fun.

Smile

Comments

4/2/2010 6:47:14 AM #

Jirka Kosek

Seems that it is only me who hasn't yet developed his own OOXML validator. Wink Any sponsor for this assignment? I have few cool ideas and I will definitively base my validator on NVDL, RELAX NG and Schematron to have more fun.

Jirka Kosek Czech Republic |

4/4/2010 4:49:15 AM #

jlundstocholm

Hi Jirka,

Seems that it is only me who hasn't yet developed his own OOXML validator

Yes, at times it seems like the access-ticket to any serious work in WG4 Wink

I have few cool ideas and I will definitively base my validator on NVDL, RELAX NG and Schematron to have more fun.

That's just to typical for you WG1-guys to start talking about the holy trinity of WG1. I have played a bit with the idea of using Mono to enable RelaxNG validation and NVDL for MCE, but I need to figure out how to get the pieces together.

Also ... validating strict documents might be fun or an status-page with the most common errors identified across validations.

Something like:

Error XX : 100% of validations failed due to this
Error YY: 91 of validations failed due to this

...

Smile

jlundstocholm Denmark |

4/6/2010 2:40:32 PM #

Rick Jelliffe

Jirka: The gap at the moment is the [XML listing of  ZIP parts in directories with XML files embedded] to allow link validation. You have mentioned your students have something similar already written: any chance of adjusting any loose ends to be suitable for use in a validator, the ISO ZIP project and making it available open source?

A .NET, a Linux C++, and a Java version would be nice too! We need better coverage.

Rick Jelliffe United States |

4/2/2010 8:11:59 AM #

Alex Brown

@Jirka

The more the better (and a RNG-based validator would be nice) - It would be even better if we could agree some kind of common reporting format so that validation results could be compared ...

Alex Brown United Kingdom |

4/6/2010 3:27:51 PM #

jlundstocholm

Hi Alex,

It would be even better if we could agree some kind of common reporting format so that validation results could be compared ...

Yes - but this kind-of requires us to agree on how validation takes place in the first place. Your validator tests each part by running a series of tests for each part it encounters whereas mine runs a series of tests - for al parts. At least this would make the reports more comparable.

Also - we would also need to figure out the different levels of error reporting. Mine currently has a bug in it that reports an error that really should be a warning. We should agree on those things as well.

(I tested this document at 6:26 this morning)

jlundstocholm Denmark |

4/3/2010 3:19:07 AM #

Miguel de Icaza

Hello  Jesper,

Is your OOXML validator available anywhere?   I would like to know if Mono is able to run it properly.

Miguel.

Miguel de Icaza United States |

4/3/2010 4:57:48 AM #

jlundstocholm

Hi Miguel,

follow the link above to the 'About'-page at  http://29500.idippedut.dk/Home/About

Smile

I am considering using Mono to use NVDL for MCE-processing or RelaxNG-validation so I'll possibly ping you later for guidance.

jlundstocholm Denmark |

7/9/2011 1:03:55 AM #

trackback

Office’s Support for ISO/IEC 29500 Strict

There has been some interest expressed lately regarding how soon Microsoft Office will offer full read

Doug Mahugh |

Comments are closed