

Creating Content with Word 2003 and XML
By Peter Vogel
For developers what’s really important about Word 2003 are the changes to the Word object model that let you access your Word document as XML. Peter Vogel shows how you take use these new XML features.
Up until Word 2003, the document’s content has been held in a proprietary format. Microsoft Word 2003 now also represents documents in an open, fully documented XML format. That’s useful and interesting but, for developers, what’s really important are the changes to the Word object model that let you take advantage of that XML representation of the document. You can, for instance, use XSLT to transform any part of your Word document.

Working with XML
You can use Word to load an XML schema and create a document by adding elements and attributes from the schema. However, I’ll begin by concentrating on a more typical scenario: Using Word to create a document like a report, a memo, or a letter. With Word 2003 you can have your code work with that Word document as an XML document, written in the WordML dialect.
By saving a Word document in XML format, you can see the WordML tags that make up the document. The absolutely simplest WordML document looks like this:
<?xml version='1.0'?>
<w:wordDocument
xmlns:w='http://schemas.microsoft.com/office/word/2003/2/wordml'>
<w:body>
<w:p>
<w:r>
<w:t>Hello, World.</w:t>
</w:r>
</w:p>
</w:body>
</w:wordDocument>
The wordDocument element contains everything needed for a Word document. Within the wordDocument element, the body element holds the document content. My example shows the paragraph (p) tags, the run of text (r) tags, and the text (t) tags that hold the text in the document. The WordML tags are flagged with the prefix ‘w’ that ties the WordML tags back to the wordml namespace defined with the xmlns attribute on the wordDocument element (my example shows the namespace used in Office Beta 2, which may change by the official release).
But you don’t have to save a document to access its XML representation. You can get to the XML text of your document through the XML property of Word’s Range and Selection objects. As an example, this sample document contains two paragraphs:
Now is the time for every good man to come to the aid of their party.
Time flies like an arrow; fruit flies like a banana.
To retrieve the XML of the second paragraph, you could use this code:
ActiveDocument.Paragraphs(2).Range.XML
You’ll get a lot of text back from the XML property—over 3,000 characters for a typical Word document. While my code asked for just the second paragraph, the XML property returns a complete WordML document with everything that could affect the text: style definitions, information about fonts, page size, margins, and more.
At this point you can pass your document to any XML tool for processing. For instance, this code loads the returned XML into a DOM parser for processing:
Dim dom As DOMDocument
Set dom = New DOMDocument
dom.loadXML ActiveDocument.Paragraphs(2).Range.XML

Changing the Document
For developers, this technology wouldn’t be interesting unless you could update the document. In the new Word object model, you can change your content using XML and the InsertXML method of the Range and Selection objects. The InsertXML method lets you insert arbitrary strings of XML into your document. While it seems counter-intuitive, even if you’re just updating a single paragraph, you must insert a complete WordML document. It’s the Selection or Range object that you use that will control what part of your document actually gets updated. This code will, for instance, replaces the currently selected text with my e-mail address:
Application.Selection.InsertXML _
"<?xml version='1.0'?><w:wordDocument “ & _
“xmlns:w='http://schemas.microsoft.com/office/word/2003/2/wordml'>“ & _
“<w:body><w:p><w:r><w:t>peter.vogel@phvis.com.</w:t>” & _
“</w:r></w:p></w:body></w:wordDocument>"
Using the InsertXML method can generate the message “XML markup cannot be inserted in the specified location.” More often than not, this message means that your XML isn’t well formed, not that you’re inserting it into the wrong place.
The real power in the InsertXML method is in the method’s second parameter, which accepts the path name to a file containing XSLT code. Passed the pathname to the file, the InsertXML method processes the XML in the first parameter using the XSLT in the file and inserts the results into the document.
For instance, this WordML body tag contains my two sample paragraphs:
<w:body>
<w:p><w:r><w:t>
Now is the time for every good man to come to the aid of their party.
</w:t></w:r></w:p>
<w:p><w:r><w:t>
Time flies like an arrow; fruit flies like a banana.
</w:t></w:r></w:p>
</w:body>
The following XSLT code finds the text in the selected range and replaces it with a more meaningful statement:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
xmlns:w='http://schemas.microsoft.com/office/word/2003/2/wordml'>
<xsl:template match="/">
<xsl:for-each select="//w:t">
<w:wordDocument><w:body><w:p><w:r><w:t>
A stopped clock is right twice a day.
</w:t></w:r></w:p></w:body></w:wordDocument>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Saved in a file called “BetterCliche.xsl”, I can update my document with this code:
ActiveDocument.Paragraphs(2).Range.InsertXML _
ActiveDocument.Paragraphs(2).Range.XML, "c:\BetterCliche.xsl"

Schema Management
The second scenario for working with Word is to load an XML schema and let users build a document by adding XML elements and attributes from the schema. This ability to create (or update) XML documents allows you to integrate Word into XML-based workflows and applications.
Only the professional version of Word 2003 supports using schemas, so the first step in your code should be to check the Application object’s AribitraryXMLSupportAvailable property. If that property returns True, you can add schemas to the XMLNamespaces collection of the Application object. The Add method for this collection must be passed the pathname to the file containing the schema, a namespace for the schema (an arbitrary string of characters used to distinguish tags with the same name in two different schemas), and a user-friendly name. In this example, the schema for the DocBook XML dialect is added with a namespace of dcb, and a friendly of ‘DocBook’ (which is what Word will display in the XML task pane):
If Application. ArbitraryXMLSupportAvailable = True Then
Application.XMLNamespaces.Add "c:\Docbook.xsd", "dcb", "DocBook"
End If
The namespace becomes important when you want to retrieve specific nodes from the document, as you’ll see later in this article.
Adding a schema to the Application’s Namespaces collection doesn’t make the schema available to be used in any particular document. For that you must use the AttachToDocument method of the schema you added. This code passes the current document to the first schema in the XMLNamespaces collection to let that document use the schema:
Application.XMLNamespaces(1).AttachToDocument ActiveDocument
The result appears in Figure 1 which shows Word with the DocBook schema loaded.
**Insert Figure 1 WordML_Fig01.jpg**
Figure 1. Creating an XML document in Word using the DocBook schema.
There is a shortcut available to you: Just add the schema to the XMLSchemaReferences collection of a Document rather than to the Application (the schema will also be added to the Application’s XMLNamespaces collection). When using the XMLSchemaReferences, the order of parameters is different from the Add method of the XMLNamespaces collection; the namespace is the first parameter, the user-friendly name is the second parameter, and the pathname to the schema comes last:
ActiveDocument.XMLSchemaReferences.Add “dcb", “DocBook”, _
"C:\schemas\DocBook.xsd"

Accessing XML Content
In documents created with a schema, you can still access your XML content through the XML property. The XML that you retrieve will contain a mixture of the nodes from the XML schema that the user is working with and WordML elements (the WordML tags control the appearance of the document in Word). However, you can limit the XML returned to just the non-WordML tags by passing True to the XML property. This code, for instance, will return the DocBook elements in my “Word plus DocBook” document:
Dim strDocBookText As String
strDocBookText = ActiveDocument.Paragraphs(1).Range.XML(True)
The DocBook tags returned from the combined document might look like this:
<?xml version="1.0" standalone="no"?>
<book>
<title>Creating Content</title>
</book>
This is actually a simplified representation, as I’ll reveal shortly.
The Range, Selection, and Document objects all have an XMLNodes collection that contains XMLNode objects that let you access the schema elements embedded in your document. These objects give you some of the access to the schema elements as you would get by loading the XML into a DOM parser.
In my sample DocBook document there are two nodes: book and title. I can retrieve the title node with this code:
Dim nd As XMLNode
Set nd = ActiveDocument.XMLNodes(2)
Now that my nd variable points to the title node, I can retrieve the node’s data with the node’s Text property. This code will display the “Creating Content” nested within the title element inside the book element:
MsgBox nd.Text

Retrieving Nodes
More often then not, you will want to retrieve and process only some of the nodes in a document. The Document object’s SelectNodes method lets you use an XPath statement to find matching Nodes and return them in a collection of nodes called a NodeList. There is one trick to using the SelectNodes method, though, which involves the namespace that you used when adding the schema to Word.
To keep the elements that make up the added schema separate from the WordML schema separate from the schema that you added, Word adds a prefix to each element in the added schema and ties it back to the namespace for the schema. In terms of XML, the full DocBook document (as opposed to the simplified version that I showed earlier) looks like this:
<ns0:book xmlns:ns0="dcb">
<ns0:title>Creating Content</ns0:title>
</ns0:book>
The ‘ns0’ that appears in this sample is the prefix that ties the DocBook elements back to the dcb namespace.
Understanding the namespace is important because, when searching for nodes, you must specify both the prefix (as part of the element name) and the relationship between the prefix and the namespace. Code to retrieve all the title elements in a DocBook document would look like this:
Dim nds As XMLNodes
Dim nd As XMLNode
Set nds = ActiveDocument.SelectNodes( _
“//ns0:title”, “xmlns:ns0=’dcb’”)
For Each nd In nds
MsgBox “Value for title tags: “ & nd.Text
Next
As you can see, the SelectNodes method takes two parameters. The first is the XPath statement that finds all the title elements (the ‘//’ tells SelectNodes to begin at the root element and find all the title elements, no matter how deeply nested). The second parameter establishes the relationship between the prefix and the dcb namespace using the same syntax as you would in an XML document.
Word 2003’s new XML-based technology lets you get back to the Word objects that you’re used to. For any node, you can retrieve the Word Range object associated with it:
Dim rng As Range
Set rng = nd.Range
Which means, of course, that you can update the document by using the InsertXML method of the Range object:
ActiveDocument.XMLNodes(2).Range.InsertXML _
“<ns0:title xmlns:ns0=’DocBook’>Word 2003 XML</ns0:title>”

Saving Your Document
When it comes time to save your document, you still use the Save method. However, if you’re in the scenario where you’re working with an XML schema, you will probably want to save you’re your embedded XML. For that you set the Document object’s, XMLSaveDataOnly property to True before calling the Save method:
ActiveDocument.XMLSaveDataOnly = True
ActiveDocument.Save
You can also do one final transformation of your document’s content by setting XMLUseXSLTWhenSaving to True and the XMLSaveThroughXSLT property to the path name of the file containing the XSLT code. This code, for instance, saves only the added schema-related tags using the ConvertToHTML.XSL stylesheet:
ActiveDocument.XMLSaveThroughXSLT = _
“c:\Transforms\ConvertToHTML.XSL”
ActiveDocument.XMLUseXSLTWhenSaving = True
ActiveDocument.Save
Regardless of which method you use, you’ll lose all the WordML tags in the combined documents. So, after saving the XML document, you’ll probably want to save the full document by turning off the XML-only options:
ActiveDocument.XMLSaveDataOnly = False
ActiveDocument.XMLUseXSLTWhenSaving = False
ActiveDocument.Save
With Word 2003, Word has acquired a rich set of functionality by integrating XML into Word’s object model. With this new power you transform document content and integrate Word into XML processing in ways not possible before.
