Linkwerk Logo

invitation2wordml.xslt - Transforming XML to WordML (part of Office Open XML)

The third lesson of transforming XML deals with the target format WordML, an XML dialect introduced with MS Word 2003. In Office 2007 MS introduced Office Open XML; the WordML shown in this article still is a part of this new format; a major difference is, that an Office Open XML document is a zip file which contains (among other files) the WordML file.

If you're familiar with MS Word, you know that this word processor has a lot of features. It's not surprising that these complexity can be found in an WordML file. Therefore an WordML generating XSLT program is more complex. But first, the steps if you want to transform my invitaion.xml to WordML:

  1. Download the sample XML file invitation.xml

  2. Download the XSLT program invitation2wordml.xslt (Download/View)

  3. Run the XSLT program on the sample data, for example:

    xalan.sh -IN ../invitation.xml -XSL invitation2wordml.xslt -OUT invitation-wordml.xml

You should get a WordML/XML file as the result of the transformation (invitation-wordml.xml). Open the transformation result in the XML enabled 2003 edition of MS Word 2003 or take a look at the screenshot of the XML file viewed with Word 2003:

invitation.wordml

Since WordML is more complex than, for example, XHTML I won't describe the XSLT listing in detail. Instead I will describe the structure of a WordML file roughly.

Below you see an empty WordML file. The root element is of type wordDocument. WordML elements are declared in the namespace http://schemas.microsoft.com/office/word/2003/wordml (prefix in my example: word). Furthermore you can mix WordML elements with common MS Office elements. The namespace identifier for the latter is urn:schemas-microsoft-com:office:office (prefix in my example: o).

o:DocumentProperties contains elements which are common to MS Office documents, while word:docPr contains Word specific properties. Both are optional.

<word:wordDocument xmlns:word="http://schemas.microsoft.com/office/word/2003/wordml" xmlns:xml="http://www.w3.org/XML/1998/namespace" xml:space="preserve">
<!-- Document properties, common to all Office documents -->
<o:DocumentProperties xmlns:o="urn:schemas-microsoft-com:office:office">
...
</o:DocumentProperties>
<!-- Word specific document properties -->
<word:docPr>
...
</word:docPr>
<word:styles>
<!-- Paragraph Styles -->
<word:style word:type="paragraph" word:styleId="..." word:default="off">
...
</word:style>
<word:style word:type="paragraph" word:styleId="StandardPara" word:default="on">
...
</word:style>
<!-- Character Styles -->
<word:style word:type="character" word:default="off" word:styleId="...">
...
</word:style>
</word:styles>
<word:body>
<!-- Paragraphs and other body elements -->
</word:body>
</word:wordDocument>

You can define styles, which you want to apply to more than one paragraph, within the word:styles elements. Word has two kinds of styles: paragraph and character styles. Both are represented in the XML file as a word:style element; the type is controlled by the type attribute.

The body carries paragraphs and other elements. A paragraph in WordML can look like this:

<word:p xmlns:word="http://schemas.microsoft.com/office/word/2003/wordml">
<word:pPr>
<word:pStyle word:val="event"></word:pStyle>
</word:pPr>
<word:r xmlns="http://schemas.microsoft.com/office/word/2003/wordml">
<word:t>
Birthday Party
</word:t>
<word:br></word:br>
</word:r>
</word:p>

A paragraph has some properties, contained within word:pPr. In this case it's just a (somewhere else defined) paragraph style, referenced by a word:pStyle element. One or more so called runs hold the text of the paragraph, but not directly within word:r; the text is placed within another element called word:t. If you're already confused, it may please you that some elements are named as in HTML, for example the empty word:br which generates a line break. There are many more elements allowed at this point of the file. You can control the formatting of the paragraph, adjust fonts, colors, size and so on. To reduce complexity of the generated file, I use paragraph styles. The referenced paragraph style, named event, is defined like this:

<word:style xmlns:word="http://schemas.microsoft.com/office/word/2003/wordml" word:type="paragraph" word:styleId="event" word:default="off">
<word:basedOn word:val="StandardPara"></word:basedOn>
<word:name word:val="Event"></word:name>
<word:pPr>
<word:jc word:val="center"></word:jc>
<word:shd word:val="clear" word:color="auto" word:fill="66BB66"></word:shd>
<word:spacing word:before="100" word:before-autospacing="off" word:after="300" word:after-autospacing="off"></word:spacing>
</word:pPr>
<word:rPr>
<word:b></word:b>
<word:shadow word:val="on"></word:shadow>
<word:sz word:val="72"></word:sz>
<word:color word:val="ffffff"></word:color>
</word:rPr>
</word:style>

A style has a word:styleId to uniquely identify it. It can be word:basedOn some other paragraph definition, which makes it easier to define derivated styles. Derivated styles may differ from the base style in, for example, just the text alignment or font size etc. The word:name of a style is display in Word in the style selection menu. The main purpose of a paragraph style definition is the definition of the paragraph properties (word:pPr) and the run properties (word:rPr). The event style defines for the paragraph a centered justification, some shading and spacing; and for the runs of event paragraphs it defines a bold face (word:b), a turned on word:shadow, a size of 72, and a white word:color.

The generation of such kind of markup with XSLT is analogue to the gerneration of XHTML or XSL-FO, which I already described. Therefore you should be able to understand my XSLT program. In order to be able to generate more sophisticated WordML documents, you certainly need more details about WordML. You can learn a little bit more by looking at my XSLT. The details about WordML are described in Microsofts reference documentation. In case you discover problems with your XML processing, don't forget: Linkwerk accepts money for consulting, development and data conversion ;-). Send us a mail.

© 2005-2007, Stefan Mintert, All rights reserved

Legal notice: If you want to use my XSLT script, please read the Office 2003 XML Reference Schema Patent License. While I don't call my piece of code a product, I have to display the following note (cited from the linked page at time of writing this, 2005):

This product may incorporate intellectual property owned by Microsoft Corporation. The terms and conditions upon which Microsoft is licensing such intellectual property may be found at http://msdn.microsoft.com/library/en-us/odcXMLRef/html/odcXMLRefLegalNotice.asp.

Update 2007: With MS Office 2007 the situation has changed. Office Open XML is an open Ecma standard and is about to become a ISO standard. There shouldn't be any restrictions in using the format; but I'm not a lawyer. Read the copyright stuff from MS, Ecma or ISO and &ndash if in doubt – ask your lawyer...