Ontology Add-in for Word 2007 - Technology Preview
We just posted a Technology Preview build of an add-in that enables adding semantic knowledge to documents by associating words in the document to ontology terms. This add-in can be installed from CodePlex, and the source is available here under the Ms-PL license.
Add-in Basics
The add-in, developed in collaboration with the University of California San Diego and Science Commons, serves as a solution accelerator for those working in the ontology field. The add-in works in two ways:
- background scanning
- direct tagging
By default the add-in will scan terms in the document and present suggestions for terms it recognizes using SmartTags. Through the SmartTag menu, authors can associate recognized words to the appropriate ontology terms.

Authors can also tag words directly, by highlight the words and selecting the ontology term that they want to associate with them.
There are a large number of ontologies available online, using the OBO format, and through the configuration dialog, additional ontologies can be downloaded to the computer for use by the add-in.
Target Audience
Looking at the developer stack from higher to lower levels of abstraction, the add-in will be useful in at least three key areas:
· Development of new ontologies
· Investigation of new author interaction paradigms
· Integration into publishing and semantic workflows
For those developing new ontologies, the add-in provides a very easy way to test those ontologies with their target audience. In many scientific disciplines, Microsoft Word is a very popular tool for authoring papers and articles, and as such, authors are already familiar with its usage and features. The add-in is able to seamlessly build on this familiarity to expose new functionality and additional ontologies can be downloaded through a REST interface.
For researchers who are focused on new ways of analyzing content and detecting terms automatically, or on extending the user interaction with authors, access to the source code provides a great foundation to build on, without having to start from scratch. Also, community members can add incremental value to the add-in and share it back with others in the community. For this purpose, CodePlex provides a good forum to host discussions, report bugs, and publish documents related to the project.
Developers that work in the publishing industry, or at libraries and repositories, can customize or extend the add-in to present a user interface specific to their organization, or to add information to the XML content when terms are recognized and tagged by the add-in. Enhancing the information in the XML tag that captures the semantic information would also be useful to those doing semantic analysis, search, or storing information in databases.
Tagging
The add-in relies on custom XML tags to associate the semantic information with the matched words. The semantic information is stored as part of the document content. Utilities and applications that read or process docx files can retrieve and use this information (or transform it to other formats).
<w:customXml w:uri="http://biolit.ucsd.edu/biolitschema" w:element="biolit-term">
<w:customXmlPr>
<w:attr w:name="id" w:val="GO:0031386" />
<w:attr w:name="type" w:val="Biological process" />
<w:attr w:name="status" w:val="true" />
<w:attr w:name="OntName" w:val="Biological process" />
<w:attr w:name="url" w:val="http://purl.org/obo/owl/GO#GO_0031386" />
</w:customXmlPr>
<w:smartTag w:uri="BioLitTags" w:element="tag1">
<w:r>
<w:t>protein tag</w:t>
</w:r>
</w:smartTag>
</w:customXml>
Additional Functionality
The add-in also enables authors to search for terms in ontologies, look up their definition, as well as browse the ontologies to understand their organization and structure, as well as examine the terms. The add-in also provides a way to highlight tagged terms, which makes it simple to review the document and identify all tagged words.

A final useful piece of functionalty is the recognition of protein ID patterns from the National Center for Biotechnology Information (NCBI) and Protein Data Bank (PDB) databanks.
