SharePoint Development from a Documentation Perspective
For these four entries, I’m going to go over in detail how to construct and register a custom parser that enables you to promote and demote properties between your custom file types and Windows SharePoint Services.
Read part one here.
Read part two here.
Read part three here.
Today, I’ll round out the document parser information I’m presenting by talking about how to register your custom parser with WSS. I’ll also give you a quick overview of the ISPDocumentParser interface, which your parser needs to implement to communicate with WSS.
To register a custom document parser with WSS, you must add a node to the document parser definition file that identifies your parser and the file type or types it can parse.
You can specify the file type or types a document parser can parse either by file extension, or file type program ID.
WSS stores the document parser definition file, DOCPARSE.XML, at the following location:
Web Server Extensions\12\CONFIG\DOCPARSE.XML
The document parser definition schema is as follows:
Following is a list of the elements in the document parser definition schema.
Required. Represents the root element of the document parser definition schema.
Required. Each docParser element represents a document parser and its associated file type. This element contains the following attributes:
· Name Required string. The file type associated with the parser. For docParser elements within the ByExtension element, set the Name attribute to the file extension. For docParser elements within the ByProdId element, set the Name attribute to the program Id of the file type. To associate a parser with multiple file types, add a docParser element for each file type.
· ProgId Required string. The program ID of the parser. This represents the ‘friendly name’ of the parser. This enables you to upgrade a parser without having to edit its document parser definition entry in the DOCPARSE.XML file. However, this prevents you from installing different versions of the same parser side-by-side.
Below is an example of a document parser definition file.
<docParser name="abc" ProgId="AdventureWorks.AWDocumentParser.ABCParser"/>
<DocParser name="AWApplication.Document" ProgId="AdventureWorks.AWDocumentParser.ABCParser"/>
In order for a custom document parser to perform document property promotion and demotion in WSS, it must implement the following document parser interfaces. These interfaces enable the document parser to be invoked by WSS, and send and receive document properties when so invoked.
Represents a custom document parser. This class includes the methods WSS uses to invoke the document parser.
Represents a property bag object used to transmit document properties between the document parser and WSS. Includes methods that enable the document parser to access the content type and document library schemas for the specified document.
Represents an enumerable document property collection. Includes methods the document parser can use to enumerate through the document property collection.
Represents a single document property. Includes methods for the document parser to get and set the document property value and data type.
When WSS invokes a document parser to promote document properties, the parser writes all document properties to a property bag object. WSS then determines which of these document properties to promote to matching document library columns. If the document has been assigned a content type, then WSS promotes the document properties that match the columns included in the content type.
Using the document parser interface, document parsers can access the content type assigned to the document, and, if desired, store the content type in the document itself. If, when WSS invokes the parser to parser a document, the parser writes the document’s content type to the property bag object as a document property, WSS compares this content type to the content types associated with the document library to which the document is being uploaded. If the document’s content type is one that is associated with the document library, WSS promotes the appropriate document properties and saves the document.
However, there are cases where the document’s content type may not actually be associated with the document library to which the user is uploading the document. For example, the user might have created the document from a document template that contained a content type; or the user might move a document from one document library to another.
If the document’s content type is not associated with the document library, WSS takes the following actions:
· If the document contains a document property for content type, but that document property is empty, WSS invokes the parser to demote the default list content type for the document library into the document. WSS then promotes the document properties that match columns in the default list content type, and stores the document.
This would occur if the document had not yet been assigned a content type.
· If the document is assigned a content type not associated with the document library, WSS determines whether the document library allows any content type. If so, WSS leaves the document’s content type as is. WSS does not promote the document content type; however, it does promote any document properties that match document library columns.
Lists can be set to allow any content type. To do this, add the Unknown Document Type content type to the list. If you add this content type to a list, then documents of any content type can be uploaded to the list without having their content types overwritten. This enables users to move a document to the list without losing the document’s metadata, as would happen if the content type was overwritten.
· If the document is assigned a content type not associated with the document library, and the document library does not allow any content types, WSS invokes the parser to demote the default list content type for the document library into the document. WSS then promotes the document properties that match columns in the default list content type, and stores the document.
The figure below details the actions taken by WSS, if the parser includes the document’s content type as a document property in the property bag returned to WSS when the parser parses a document.
WSS never promotes a document’s content type onto a document library.
In my final post of this series, I'll introduce you to the document parser definition schema, and the document parser interface.
In my last entry, I gave you a brief overview of what document parsers are in Windows SharePoint Services V3, and a high-level look at what you need to do to build a custom document parser for your own custom file types. Today we’re going to start digging a little deeper, and examine how a parser interacts with WSS in detail.
When a file is uploaded, or move or copied to a document library, WSS determines if a parser is associated with the document's file type. If one is, WSS invokes the parser, passing it the document to be parsed, and a property bag object. The parser extracts the properties and matching property values from the document, and adds them to the property bag object. The parser extracts all the properties from the document.
WSS accesses the document property bag and determines which properties match the columns for the document. It then promotes those properties, or writes the document property value to the matching document library column. WSS only promotes the properties that match columns applicable to the document. The columns applicable to a document are specified by:
· The document's content type, if it has been assigned a content type.
· The columns in the document library, if the document does not have a content type.
WSS also stores the entire document property collection in a hash table; this hash table can be accessed programmatically by using the SPFile.Properties properties. There is no user interface to access the document properties hash table.
The following figure illustrates the document parsing process. In it, the parser extracts document properties from the document and writes them to the property bag. Of the four document properties, three are included in the document's content type. WSS promotes these properties to the document library; that is, writes their property values to the appropriate columns. WSS does not promote the fourth document property, Status, even though the document library includes a matching column. This is because the document's content type does not include the Status column. WSS also writes all four document properties to a hash table that is stored with the document on the document library.
WSS can also invoke the parser to demote properties, or write a column value into the matching property in the document itself. When WSS invokes the demotion function of the parser, it again passes the parser the document, and a property bag object. In this case, the property bag object contains the properties that WSS expects the parser to demote into the document. The parser demotes the specified properties, and WSS saves the updated document back to the document library.
The figure below illustrates the document property demotion process. To update two document properties, WSS invokes the parser, passing it the document to be updated, and a property bag object containing the two document properties. The parser reads the property values from the property bag, and updates the properties in the document. When the parser finishes updating the document, it passes a parameter back to WSS that indicates that it has changed the document. WSS then saves the updated document to the document library.
Once the document parser writes the document properties to the property bag object, WSS promotes the document properties that match columns on the document library. To do this, WSS compares the document property name with the internal names of the columns in the document’s content type, or on the document library itself if the document doesn’t have a content type. When WSS finds a column whose internal name matches the document property, it promotes the document property value into that column for the document.
However, WSS also enables you to explicitly map a document property to a specific column. You specify this mapping information in the column’s field definition schema.
Mapping document properties to columns in the column’s field definition enables you to map document properties to columns that may or may not be named the same. For example, you can map the document property ‘Author’ to the ‘Creator’ column of a content type or document library.
To specify this mapping, add a ParserRef element to the field definition schema, as shown in the example below:
<Field Type=”Text” Name=”Creator” … >
<ParserRef Name=”Author” Assembly=”myParser.Assembly”>
The following elements are used to define a document property mapping:
Optional. Represents a list of document parser references for this column.
Optional. Represents a single document parser reference. This element contains the following attributes:
· Name Required String. The name of the document property to be mapped to this column.
· Assembly Required String. The name of the document parser used.
A column’s field definition might contain multiple parser references, each specifying a different document parser.
In addition, if you are working with files in multiple languages, you can use parser references to manage the mapping of document properties to the appropriate columns in multiple languages, rather than have to build that functionality into the parser itself. The parser can generate a single document property, while you use multiple parser references to make sure the property is mapped to the correct column for that language. For example, suppose a parser extracts a document property named ‘Singer’. You could then map that property to a column named ‘Cantador’, as in the example below:
<Field Type=”Text” Name=”Cantador” … >
<ParserRef Name=”Singer” CLSID=”MyParser”>
<ParserRef Name=”Artist” Assembly=”MP3Parser.Assembly”>
To edit a column’s field definition schema programmatically, use the SPField.SchemaXML object. There is no equivalent user interface for specifying a parsers for a column.
In the next entry, we'll discuss how WSS processes document that contain their content type definition.
Now that I’ve talked about the built-in XML parser, and how you can use it to promote and demote document properties for XML files, you might be thinking: what about custom files types that aren’t XML? What if I’ve got proprietary binary file types from which I want to promote and demote properties to the SharePoint list?
We’ve got you covered there as well.
For the next four entries, I’m going to go over in detail how to construct and register a custom parser that enables you to promote and demote properties between your custom file types and Windows SharePoint Services.
This information will get rolled into the next update of the WSS SDK, so consider this a preview if case you want to work with the parser framework right now.
Managing the metadata associated with your document is one of the most powerful advantages of storing your enterprise content in WSS. However, keeping that information in synch between the document library level and in the document itself is a challenge. WSS provides the document parser infrastructure, which enables you to create and install custom document parsers that can parse your custom file types and update a document for changes made at the document library level, and vice versa. Using a document parser for your custom file types helps ensure that your document metadata is always up-to-date and synchronized between the document library and the document itself.
A document parser is a custom COM assembly that, by implementing the WSS document parser interface, does the following when invoked by WSS:
· Extracts document property values from a document of a certain file type, and pass those property values to WSS for promotion to the document library property columns.
· Receives document properties, and demote those property values into the document itself.
This enables users to edit document properties in the document itself, and have the property values on the document library automatically updated to reflect their changes. Likewise, users can update property values at the document library level, and have those changes automatically written back into the document.
I’ll talk about how WSS invokes document parsers, and how those parsers promote and demote document metadata, in my next entry.
For WSS to use a custom document parser, the document parser must meet the following conditions:
· The document parser must be a COM assembly that implements the document parser interface.
I’ll go over the details of the IParser interface in a later entry.
· The document parser assembly must be installed and registered on each front-end Web server in the WSS installation.
· You must add an entry for the document parser in DOCPARSE.XML, the file that contains the list of document parsers and the file types with which each is associated.
And I’ll give you the specifics of the document parser definition schema in a later entry as well. All in good time.
WSS selects the document parser to invoke based on the file type of the document to be parsed. A given document parser can be associated with multiple file types, but you can associate a given file type with only one parser.
To specify the file type or types that a custom document parser can parse, you add a node to the Docparse.XML file. Each node in this document identifies a document parser assembly, and specifies the file type for which it is to be used. You can specify a file type by either file extension or program ID.
If you specify multiple document parsers for the same file type, WSS invokes the first document parser in the list associated with the file type.
WSS includes built-in document parsers for the following file types:
· OLE: includes DOC, XLS, PPT, MSG, and PUB file formats
· Office 2007 XML formats: includes DOCX, DOCM, PPTX, PPTM, XLSX and XLSM file formats
· HTM: includes HTM, HTML, MHT, MHTM, and ASPX file formats
You cannot create a custom document parser for these file types. With the XML parser, you can use content types to specify which document properties you want to map to which content type columns, and where the document properties reside in your XML documents.
To guarantee that WSS is able to invoke a given parser whenever necessary, you must install each parser assembly on each front end Web server in your WSS installation. Because of this, you can specify only one parser for a given file type across a WSS installation.
The document parser framework does not include the ability to package and deploy a custom document parser as part of a SharePoint feature.
In my next post, I’ll discuss how the document parser actually parses documents and interacts with WSS.
Today I'd like to talk about how to store and manage XML forms (InfoPath and otherwise) in WSS V3.
In general, forms have three special areas of functionality:
· Property Promotion and Demotion This refers to promoting and demoting document data to and from columns in a SharePoint library.
· Link Management This refers to how WSS keeps the link to the form template in the XML file up to date, if any site, sub-site, or library is renamed.
· Merge Forms Sends multiple XML files that have the same schema to a client application to be merged.
In Windows SharePoint Services Version 2, this special functionality was part of the Form Library site template. Now, in WSS V3, it's encapsulated in the Form content type instead. Instead of creating a new form library, you can create a new content type, inheriting from the Form content type. Your new form becomes the template for the content type.
This new approach offers several major advantages over the form libraries:
· Central management of forms Because you're creating a new content type, you gain all the advantages content types offer in terms of centralized management. You can control the form and its metadata from a single location, regardless of how many libraries to which you've added the content type. You can also specify enterprise content management features for your content type, such as property promotion and demotion, workflow, or (if you have Office Server 2007) information policy. And you don't have to republish the form if you make changes to the content type for which it's the form template.
· Add multiple forms to the same library Again, because each type of form is a separate content type, you can add multiple form type to the same library.
· Add forms to the same libraries as documents You can form content types to the same libraries that contain document content types. In a very real sense, there's no distinction between form and document libraries in WSS V3; there're just libraries. Any special functionality is encapsulated in the content types themselves.
(However, the form library site template has been retained in WSS V3 for purposes of backward compatibility with XML-based forms editors that do not support publishing forms to content types, such as InfoPath 2003. But if you are using an XML-based forms editor that supports content types, such as InfoPath 2007, we strongly recommend you use the Form content type instead of the form library template. And with all the obvious advantages, why wouldn't you?)
So let's talk specifically about InfoPath 2007 forms for WSS V3. InfoPath 2007 includes the ability to publish forms directly into new or existing content types on a WSS installation. This includes setting up property promotion and demotion: you're able to choose which data fields on your form you want to map to which SharePoint columns. You get link management and merge forms functionality for free, because the content type to which you publish your form is a child of the Form content type.
When you publish an InfoPath 2007 form, you have the option to publish it to a SharePoint site. When you do, you have the further option of publishing the form into a new or existing content type.
(You still have the option of publishing your form to a document library, just as you did with InfoPath 2003. In InfoPath 2007, though, you can update both document libraries as well as form libraries, as long as that library’s default content type is a child of the Form content type. If the library doesn’t have multiple content types, then the single library content type must be a child of Form in order for you to update it. In addition, document property demotion now works in document libraries.)
When you choose to create a new content type, you select the content type from which you want your new content type to inherit. By default, the new content type is a direct child of the Form content type, but you can specify any descendant of Form to use as your new content type's parent.
Next, you can choose which data fields you want to promote and demote to and from SharePoint site columns. To do this, you specify the columns to which you want to map the data fields in your form. You can map the data fields to existing site columns, or create new columns.
If you map a data field to an existing site column, InfoPath adds a <FieldRef> element to the content type for that site column. The <FieldRef> element contains a set of attributes you can override for the site column, as it applies to items assigned this content type. These attributes include those that specify where in an XML form the data field information is located. Because you’re mapping a data field (i.e., the data in your XML) to the column, InfoPath uses the Node attribute to specify an XPath expression for where the data resides in the form.
For more information on how the built-in XML parser uses the Node attribute for data promotion and demotion, see this earlier post.
If you create a new site column, InfoPath adds a <Field> element that represents that site column to the site, and then adds a <FieldRef> element to the content type for that site column.
For more information on how <Field> and <FieldRef> elements differ, and how <FieldRef> elements work in a content type, see my earlier post here.
If you choose to update the existing content type itself, you can stick with the existing columns for that content type, or add new ones. You can also remove columns for data fields you've deleted or don't want promoted or demoted anymore. If you do delete a column, InfoPath removes the <FieldRef> element from the content type definition, but leaves the site column intact (that is, the <Field> element for that site column remains in the site definition). You cannot remove any columns the content type inherits from its parent content type.
Once you've published your form to a content type, you can use WSS to further customize the content type, such as adding columns, workflows, and policies. Also, remember that this is a site content type; in order to use the content type in a library, you have to add that site content type to the library through WSS.
Here are a few other things to keeping mind about InfoPath 2007 forms and form content types:
InfoPath forms themselves do not contain the content type ID of the content type they’ve been assigned. Instead, WSS uses the form template to determine the content type of the form. The form template URL is included in the <?xml> processing instruction of the form, and points to the .xsn on which the form is based.
The built-in XML parser included in WSS first attempts to find the content type ID in the XML file, and, failing that, attempts to determine the content type based on the form template. So, for InfoPath forms, the built-in XML parser always fails its first check, and has to use the form template to determine content type.
For more information on how the XML parser determines content type, see this earlier post.
The program ID for InfoPath 2007 forms is included in the forms in the progId2 attribute of the <?mso-infopathdocument> processing instruction. In content type definitions, InfoPath maps this to the ProgID column using the PrimaryPITarget and PrimaryPIAttribute attributes.
So, for content types for InfoPath 2007 forms, the attributes for the ProgID <FieldRef> element are set to the following values:
PITarget = mso-infopathdocument
PIAttribute = InfoPath.Document
PrimaryPITarget = mso-infopathdocument
PrimaryPIAttribute = InfoPath.Document.2
Because WSS looks at the PrimaryPITarget and PrimaryPIAttribute pair first, the InfoPath.Document.2 attribute value is promoted to the ProgID column. The ProgID column determines which application WSS calls to open a selected file.
Written while listening to The Twilight Singers : Twilight as Played by the Twilight Singers
This is the final post in a five-part series on how to use the built-in XML parser in WSS V3 to promote and demote document properties in your XML files, including InfoPath 2007 forms.
Read part four here.
You can include namespace prefixes in the XPath expressions you use to specify the location of a document property. To use a namespace prefix, you must first define it in a custom XMLDocument element included in the content type definition.
The namespace for this XML node is:
The structure of the custom XMLDocument node is as follows:
Following is a list of the elements in the schema, and their definitions:
Namespaces The root element. Represents a collection of namespaces for use in XPath expressions.
For each namespace you specify, you must include the following values:
· prefix The prefix to use when designating this namespace.
· value The actual namespace for which you are specifying a prefix.
This schema describes optional XML you can include in a content type as custom information. For more information on including custom information in a content type, see this topic in the WSS V3 Beta 2 SDK.
This is the fourth in a five-part series on how to use the built-in XML parser in WSS V3 to promote and demote document properties in your XML files, including InfoPath 2007 forms.
There are two document properties the built-in XML parser examines to determine the content type to assign to an XML document when a user first uploads it to a given document library. The parser must determine which of the content types associated with the document library to assign the document before the parser can promote or demote document properties.
For a detailed examination of the process the parser performs to match a document’s content type with a content type associated with the document library, see part three.
The parser first looks for a processing instruction that specifies the document’s content type by content type ID. The location of this processing instruction is included in the definition for the content type ID column template. The processing instruction is named MicrosoftWindowsSharePointServices and contains an attribute named ContentTypeID that represent the ID of the document's content type.
Name="Content Type ID"
By default, all library list templates include a column that represents the content type ID.
Add this processing instruction to your XML document. Set the ContentTypeID attribute to the ID of the document’s content type.
<? MicrosoftWindowsSharePointServices ContentTypeID=”0x010101003D7907A1908011d082BD08005AA74F5E00A557E10DA69DBF4C8BE1E21071B08163”/>
The parser will fail to determine the content type in the following situations:
· The MicrosoftWindowsSharePointServices processing instruction isn’t present in the document
· The processing instruction does not specify a content type
· The specified content type is not associated with the document library
· No parent or child of the specified content type is associated with the document library
If the parser cannot identify the content type by content type ID, it performs a second check, detailed below.
Note The parser looks for the content type ID in whatever document location you specify in the field definition for the Content Type ID column on the document library. You can map the Content Type ID column to any processing instruction or XPath expression you choose. However, we recommend you adhere to the default mapping include in the content type ID column template definition. This minimizes the chance of having content types that specify a different location for this document property than the document library with which they are associated. Such a situation would lead to the XML parser looking in the wrong document location for the content type ID.
If the parser fails to determine a suitable content type for the document based on content type ID, the parser then looks for a processing instruction that contains the URL of the document template on which the document is based. The processing instruction is named mso-infoPathSolution that contains an attribute named href that represents the URL of the document template.
This column is included in the Form content type, and is added to a library anytime that content type is added to the library.
So, rather than include a content type ID, you can add this processing instruction to your XML document. Set the href attribute to the URI of the document template on which the document is based.
If the parser finds this processing instruction, it then examines the content types associated with the document library to determine if a content type has the same document template. If so, the parser assigns that content type to the document. If more than one content type associated with the document library has the same matching document template, the parser simply assigns the first content type it finds that matches.
Note The parser looks for the document template URL in whatever document location you specify in the field definition for the Document Template column on the document library. You can map the Document Template column to any processing instruction or XPath expression you choose. However, we recommend you adhere to the default mapping included in the document template column template definition. This minimizes the chance of having content types that specify a different location for this document property than the document library with which they are associated. Such a situation would lead to the XML parser looking in the wrong document location for the document template.
In the final installment of this series, I'll show how you can include namespace prefixes in the XPath expressions you use to specify the location of a document property.
This is the third in a five-part series on how to use the built-in XML parser in WSS V3 to promote and demote document properties in your XML files, including InfoPath 2007 forms.
In order for the built-in XML parser to determine the document’s content type, and thereby access the content type definition, the document itself must contain the content type as a document property. The parser looks for a special processing instruction in your XML documents to identify the document’s content type. You can include processing instructions that identify the document’s content type by content type, and/or document template.
When a user uploads an XML document to a document library, WSS V3 invokes the built-in XML parser. Before the parser can promote document properties, it must first determine the content type, if any, of the document itself.
The parser first looks at the Field element in the document library schema that represents the content type ID column on the document library. The parser examines the Field element for the location in the document where the content type ID should be stored. The parser then determines if the content type ID is indeed stored in the document at this location. If no content type ID is specified at that location, the parser assigns the default content type to the document. The parser then uploads the document and promotes any document properties accordingly.
If the document does contain a content type ID at the specified location, the parser determines if the content type with that ID is also associated with the document library. If it is, the parser uploads the document and promotes any document properties accordingly.
If the parser doesn't find an exact match, it examines ID's of the content types on the document library to determine if one or more are children of the document content type. If so, the parser assigns the closest child content type to the document. The parser then uploads the document and promotes any document properties accordingly.
Note The parser examines the list for content types that are children of the document content type because, in most cases, the document will have been assigned a site content type. In such cases, the matching list content type will be a child of the site content type.
If the parser finds no content type match at all, it then looks at the Field element in the document library schema that represents the document template column on the document library, if such a column is present. If the document library does contain a document template column, the parser examines the Field element for the location in the document where the document template should be stored. The parser then determines if the document template is indeed stored in the document at this location.
If the document does contain a document template, the parser compares the template with the document templates specified in each content type on the document library. If the parser finds a content type with the same document template as the document, the parser assigns that content type to the document. If there are multiple content types with the same document template as the document, the parser simply assigns the first such content type it finds. The parser then uploads the document and promotes any document properties accordingly.
Finally, if the parser cannot find a content type match, the parser assigns the default content type to the document. The parser then uploads the document and promotes any document properties accordingly.
The flow chart below illustrates the checks the parser performs to determine a document's content type.
For more information on how the parser promotes and demotes specific document properties, part two of this series.
One aspect of the parser's operation that bears emphasis is the fact the parser looks to the document library's content type and document template columns to determine where in the XML file to look for those matching document properties. Therefore, in order for promotion and demotion to work correctly, all content types on a given document library must contain content type and document template column definitions that specify the same location for those document properties as the document library columns. Otherwise, the parser will be looking in the wrong location within the document for those properties.
In the next entry, we’ll take a look at how you actually specify the content type of the document, inside that document itself.