As we get comments and questions from you about the add-in, or have opportunities to discuss the experience and underlying technology face-to-face (as at the HighWire Publisher's Meeting last week), it is interesting how frequently the topic of structured content comes up.

Looking Back and Looking Forward

Although there is more awareness, visibility, and articles on the topic of legacy content (book scanning), it is important, and perhaps even more so, to focus on how new content is being created.  As the consumption of content makes the transition from print to digital, and search becomes one of the key ways we come across new content (journals and conferences being the other traditional ways), it is important that we evolve the process by which new content is created, in order to be able to fully exploit the benefits of the new digital medium.

The way articles are commonly authored today is still largely focused on print as the end point, in that there is still a larger focus on presentation over semantics.  The semantics in many workflows are added after the article is approved, somewhere along the publishing pipeline, by people other than the article authors. We need to enable authors to add semantics as part of the authoring process, and have that content preserved through the publishing workflow.  The best technology available to preserve semantic content today is to express the article in XML.  Archiving articles as plain text (either HTML or PDF) results in a loss of information.  And, this loss of information is not only detrimental to effective search today, but will also be detrimental to other types of semantic analysis in the future.

Lossy Workflows

In some print focused workflows, even if semantic elements were present in the original file created by the author, the semantics are lost in the digital version, as a result of the process leading to print.  At the end of this type of workflows, a PDF file is generated, reflecting the print layout.  Any semantic information in the original article is lost in the resulting PDF.  The PDF file, in turn, is used to generate an XML file, which is the basis for the digital content (and usually has to be sent out for tagging).  These types of workflows not only result in a loss of semantic information, but are inefficient from the point of view of creating digital content.  As journals move to be digital first (or digital only) editors will need to pay close attention to how their workflows are structured and how data is preserved/converted.

Presentation vs Semantics

Ideally, if we do a good job at capturing semantics during authoring, we can ignore the final presentation throughout the publishing workflow.  Layout and other presentation elements (margins, font family, color, size, etc) can be applied to the archive version of the document for viewing (for example, during the generation of the HTML files).  Of course, we do not want authors dealing with raw XML editing.  Having a nice presentation within the word processor helps with authoring, but, it does not need to be the final presentation, instead it can be the presentation that best works for the author.

Currently some Word article templates rely on Styles to identify semantics.  This is the best that could be done with the capabilities of previous versions of Word, but it is fragile as it mixes presentation and semantics.  To make matters worse, it is easy for styles to "leak" through copy and paste, or editing, invalidating the semantics.  Word 2007's use of XML as its native format, its ability to have custom XML elements, and the extensibility of Word's user interface and file packaging, enable a more robust way of entering semantic elements during authoring, preserving metadata, and enabling conversion to other formats.

Capturing Semantics and Authors' Insights

Authors will be largest class of users of the add-in, so we focus a lot on the experience that is presented to this audience.  Authors likely will have no idea of the format used to back their articles (whether OpenXML or NLM), nor should they care.  Also, the richness and complexity of the metadata expressible in the NLM format does not need to be exposed to them in a raw form (but needs to be accessible to the journal/archival staff - I will cover this in a future posting).

What is the benefit to authors from capturing semantics and metadata?  As semantic search evolves, articles with more/better semantic data should become more relevant in search results than articles without this information.  Thus, as content moves to be consumed primarily in digital form, articles with better semantics stand a better chance of being found, read, and cited.

Along the concept of the Dublin Core (or the core of the Core), we are focusing on enabling journals to capture a set of data from authors:

  • Sections (title, abstract, etc)
  • Authors information
  • Keywords and subjects
  • Author notes

While it would be interesting to capture additional information within the content during authoring, it is important not to overburden authors.  At least if we manage to get this small set of data reliably, and reduce entry errors, then we will provide a good baseline for metadata in articles.  Over time, as authors become comfortable with the concept, and see benefits, the baseline can be moved up (but again, the user interaction needs to be simple and, as much as possible, unobtrusive).

Additional Reading

Peter Murray-Rust has a couple of threads with related topics, on semantics and chemistry and structured content and PDF.