Welcome to MSDN Blogs Sign in | Join | Help

Feedback on your usage model

Participation in the Beta program so far has been fantastic and we have been able to incorporate most of the feedback and requests that you have sent in so far.  It is evident that this community is very hands-on and enthusiastic about the authoring and archiving process.

We have been able to engage in a good dialog with several community members, with folks submitting their sample documents for testing and scenarios as input.  While a lot of people have downloaded the add-in, not everyone has contacted us with comments and feedback yet, so I wanted to open up the dialog and solicit your input. 

As expected, the majority of the early adopters are part of the staff at journals, repositories, libraries, and also companies that support the publishing workflow.  There are a few enthusiasts/early adopters, as well as folks interested in the writing process and on capturing semantics, who have also tried out the add-in and sent their feedback.  Many thanks to all that have contacted us!

To help kickstart the broader dialog, here are some questions as to how you are using, or planning to use, the add-in as part of your workflow:

  • Do you start using the add-in by importing an xml file, or by pasting in content from another document?
  • Do you use the add-in for editing content, metadata, or both?
  • Do you introduce new sections into documents?  Do you tend to use custom sections?
  • Have you created templates, with sections that you use often, for re-use?
  • Have you used the keyword or subject area panel (on the right side of the window)?
  • Are there elements in the NLM format which are not currently supported in the add-in which are essential for your workflow?
  • Do you plan to access the content or metadata in the file directly, without using Word (for example using your own internal tools)?

If there is anything else you want to comment on in relation to the add-in, feel free to send it in.  You can post your answers as comments on the blog or send them over email.  This is a great opportunity to engage in the development of the add-in and help shape the experience for authoring scientific and technical articles.

And, once more, many thanks for your participation and input.

Posted by pablofe | 0 Comments

Beta of Microsoft’s Article Authoring Add-in Now Available Broadly for Download

Enabling Journals to Better Connect with Scientific Authors in a Digital World

 

 

REDMOND, Wash. — May 30, 2008 — At the annual meeting of the Society for Scholarly Publishing in Boston, Microsoft announced the wide availability of the Beta 1 release of the Article Authoring Add-in for Microsoft Word 2007. In addition to enabling Word users to open and save documents using the National Library of Medicine’s XML Journal Publishing format, used for the authoring of scientific articles, the Beta 1 release adds support for the NCBI Book format, used for authoring book chapters for digital books.

 

Enabling Journals to Better Connect with Authors in a Digital World

A key value of the Article Authoring add-in is in enabling editors at scientific and technical journals to create article templates, tailored for their individual journals’ requirements.  These templates will assist authors in writing articles with greater consistency in relation to the structure of the articles, better reflecting the content requirements of the journals, and in expressing semantic information which is key for the search and consumption of articles in digital form.

 

“The Add-In is a very positive development that will help scholars to write and tag their articles in the industry-standard NLM XML DTD, and will help publishers to process these articles in their editorial and production departments. We are pleased to be working with Microsoft on testing and refining this important tool that will benefit scholars and scholarly publishers alike”, said Ahmed Hindawi, CEO of Hindawi Publishing Corporation.

 

Preserving Information for Archiving and Search

The Article Authoring add-in enables authors to express a greater variety of semantic information, and metadata, as part of writing articles.  This semantic information, captured in the XML format and preserved based on the extensibility in the Open XML standard, will prove valuable in improving the results from search queries and for the long term archival of scientific information.

 

In addition to preserving information that is native to Microsoft Word, the Beta 1 release of the Article Authoring add-in also preserves Math information from controls, such as Design Science’s MathType, when saving Word documents to the NLM XML format.  Paul Topping, President and CEO of Design Science, Inc., stated that "We were happy to work with Microsoft to add support for Equation Editor and MathType equations to the Article Authoring add-in. Since at least 85% of the articles containing math submitted to scientific journals have equations in those formats, this support is critical."

 

The Open XML standard, with its capabilities to support custom-schemas, enables the Word add-in to support the entire set of rich information encoded by the NLM format. The add-in also provides easy access to the metadata in the NLM format, both by journal editors and by authors, directly from within the Word user interface.  The broad availability of the Beta 1 release provides a way for the different communities, such as authors, journals, digital archives, and software vendors, to evaluate the technology and provide feedback, guiding further development of the add-in towards its initial release in the second half of 2008.

 

Information on how to download the Beta 1 release of the Article Authoring Add-in for Microsoft Office Word 2007 can be found at http://www.microsoft.com/downloads/details.aspx?FamilyID=09c55527-0759-4d6d-ae02-51e90131997e&displaylang=en.

Posted by pablofe | 0 Comments
Filed under: , ,

See You at SSP Next Week?

I, along with folks from the Word team and Technical Computing Initiative/Microsoft Research, will be in Boston next week, participating in a session on Wednesday morning (Seminar 1).

Also on Friday, I will be hosting a informal discussion table during the Networking Lunch.

If you are interesting in a demo or having discussions during the event, drop a comment here, or for a more general get together, join up this Facebook event.

Hope to see you there.

Posted by pablofe | 0 Comments

May Tidbits

A few interesting news:

  •  Aries Systems now supports Word 2007 docx files
  •  The Times Reader will start a Beta of their Mac version

Last, I wanted to highlight that we will be attending (and presenting) at the upcoming Society for Scholarly Publishing meeting in Boston in a couple of weeks.  If you want to meet up, join this event on Facebook, or drop a comment on this blog.

Posted by pablofe | 0 Comments
Filed under:

The Power of Structured Content

As we get comments and questions from you about the add-in, or have opportunities to discuss the experience and underlying technology face-to-face (as at the HighWire Publisher's Meeting last week), it is interesting how frequently the topic of structured content comes up.

Looking Back and Looking Forward

Although there is more awareness, visibility, and articles on the topic of legacy content (book scanning), it is important, and perhaps even more so, to focus on how new content is being created.  As the consumption of content makes the transition from print to digital, and search becomes one of the key ways we come across new content (journals and conferences being the other traditional ways), it is important that we evolve the process by which new content is created, in order to be able to fully exploit the benefits of the new digital medium.

The way articles are commonly authored today is still largely focused on print as the end point, in that there is still a larger focus on presentation over semantics.  The semantics in many workflows are added after the article is approved, somewhere along the publishing pipeline, by people other than the article authors. We need to enable authors to add semantics as part of the authoring process, and have that content preserved through the publishing workflow.  The best technology available to preserve semantic content today is to express the article in XML.  Archiving articles as plain text (either HTML or PDF) results in a loss of information.  And, this loss of information is not only detrimental to effective search today, but will also be detrimental to other types of semantic analysis in the future.

Lossy Workflows

In some print focused workflows, even if semantic elements were present in the original file created by the author, the semantics are lost in the digital version, as a result of the process leading to print.  At the end of this type of workflows, a PDF file is generated, reflecting the print layout.  Any semantic information in the original article is lost in the resulting PDF.  The PDF file, in turn, is used to generate an XML file, which is the basis for the digital content (and usually has to be sent out for tagging).  These types of workflows not only result in a loss of semantic information, but are inefficient from the point of view of creating digital content.  As journals move to be digital first (or digital only) editors will need to pay close attention to how their workflows are structured and how data is preserved/converted.

Presentation vs Semantics

Ideally, if we do a good job at capturing semantics during authoring, we can ignore the final presentation throughout the publishing workflow.  Layout and other presentation elements (margins, font family, color, size, etc) can be applied to the archive version of the document for viewing (for example, during the generation of the HTML files).  Of course, we do not want authors dealing with raw XML editing.  Having a nice presentation within the word processor helps with authoring, but, it does not need to be the final presentation, instead it can be the presentation that best works for the author.

Currently some Word article templates rely on Styles to identify semantics.  This is the best that could be done with the capabilities of previous versions of Word, but it is fragile as it mixes presentation and semantics.  To make matters worse, it is easy for styles to "leak" through copy and paste, or editing, invalidating the semantics.  Word 2007's use of XML as its native format, its ability to have custom XML elements, and the extensibility of Word's user interface and file packaging, enable a more robust way of entering semantic elements during authoring, preserving metadata, and enabling conversion to other formats.

Capturing Semantics and Authors' Insights

Authors will be largest class of users of the add-in, so we focus a lot on the experience that is presented to this audience.  Authors likely will have no idea of the format used to back their articles (whether OpenXML or NLM), nor should they care.  Also, the richness and complexity of the metadata expressible in the NLM format does not need to be exposed to them in a raw form (but needs to be accessible to the journal/archival staff - I will cover this in a future posting).

What is the benefit to authors from capturing semantics and metadata?  As semantic search evolves, articles with more/better semantic data should become more relevant in search results than articles without this information.  Thus, as content moves to be consumed primarily in digital form, articles with better semantics stand a better chance of being found, read, and cited.

Along the concept of the Dublin Core (or the core of the Core), we are focusing on enabling journals to capture a set of data from authors:

  • Sections (title, abstract, etc)
  • Authors information
  • Keywords and subjects
  • Author notes

While it would be interesting to capture additional information within the content during authoring, it is important not to overburden authors.  At least if we manage to get this small set of data reliably, and reduce entry errors, then we will provide a good baseline for metadata in articles.  Over time, as authors become comfortable with the concept, and see benefits, the baseline can be moved up (but again, the user interaction needs to be simple and, as much as possible, unobtrusive).

Additional Reading

Peter Murray-Rust has a couple of threads with related topics, on semantics and chemistry and structured content and PDF.

Posted by pablofe | 0 Comments

The Increasing Relevance of XML in the Overall Content Lifecycle

I had a conversation with Jon Udell on the add-in, and the greater relevance of XML in the publishing lifecycle, which he just posted on the channel 10 site.

Jon has a great approach to this new series of postings, where he provides different modalities for consuming the content.  Audiences can read a transcript, watch a screencast, or listen to the podcast.  Incidentally, over the last year, after I bought a Zune with its ability to automatically manage podcast subscriptions, I started to listen to podcasts while jogging and find it a very convenient way to catch up on podcasts (both technical and general topics, such as NPR programs) while exercising.  Providing multiple formats, and adapting to how people will consume content in the future, is a trend we need to keep an eye on for evolving the STM publishing space.

On its 10th anniversary, the infrastructure around XML (formats, tools, as well as processes/workflows) is now at a point where we will finally derive greater benefits from this great technology.  XML usage will now start to take place at the point of origin for content, without authors being directly aware of it, and its usage throughout the publishing workflow will be simplified.  The net result should be an improvement in time to publish, relevance in search, and in the presentation of the content (more on this in a future post).

Publishing Workflow – Math content as paths vs glyphs in generated PDF files

Recently I was involved in diagnosing an issue where, when a PDF file was generated from Word 2007, the Math content from the Word document was being converted to paths, instead of being represented by glyphs from the Cambria Math font.

Note that you can download a free add-in to generate PDF files from Word 2007 from here.  Also, Word 2007 has quite a bit of new Math functionality, and a beautiful font to go along with it (Cambria Math).

When the goal is to generate high quality content, whether Math content is represented as paths or glyphs makes a difference.  Note that this is not something that a casual observer would necessarily notice, as seen in these screen shots at 100% magnification.

The first screenshot is from the content in Word.  The second is of the generated PDF file with the content as paths.  The third image is of the generated PDF file with the content as glyphs.  There is very little difference in the three screen shots below (at least to me).

  Original content in Word

Original content in Word

Path based content at 100%

Path based content in Adobe’s PDF viewer (100% zoom) 

Glyph based content at 100%

Glyph based content in Adobe’s PDF viewer (100% zoom)

 

However, when zooming in at 600%, it is possible to start noticing that, in the case where paths were used, the curves have discrete line segments, whereas the glyph version continues to be smooth.

 Path based content at 600%

Path based content at 600%magnification, note the aliasing on the curved segments.

Glyph based content at 600%

Glyph based content at 600% magnification, perfect!

Initially we could not reproduce the problem in our environment over here (let me tell you how much I hate it when we cannot reproduce bugs).  In talking to the folks that reported the problem, we verified that the fonts were correctly installed and the font file versions were as expected.  Folks in the Word team then tracked down under which conditions paths would get generated, and we also found out that the original problem was being seen on a Windows Server 2003 installation, not on a client configuration.  Note that this is not an issue that one would run into with Windows Vista, because it has a different default configuration.

From there it was straightforward to verify and solve the problem.  When exporting content, Word checks whether Complex Scripts or Far East scripts are enabled on the machine, to decide whether to generate paths or glyphs for Math content.  In case you run into a similar issue, the solution was to enable both scripts on the server (which may require the installation disk and a reboot), through the Languages tab in the Regional Settings control panel.

It is nice to have a happy ending to problems, and in this case being able to preserve high quality Math content.

Posted by pablofe | 3 Comments

Multi Discipline Relevance

Geoff makes the point that he thinks the NLM dtd is relevant for disciplines outside of Biology and Medicine, such as Humanities and Social Sciences.  I agree with his point and hope that with the add-in we will help authors and journals in those other disciplines, which already tend to use Word, in the submission and conversion processes.  Even going beyond the conversion to the NLM tagset, I think that there are other features in the add-in that will be of use to authors and to the editorial staff.  I plan to cover both of these aspects in future postings.

Covering yet another set of scientific disciplines, last month the folks from ArXiv (the largest repository for Physics, also very popular for Math and Computer Science papers) posted the news that they now accept OpenXML files for submissions.  I am also looking forward to being able to help scientists and researchers that deposit OpenXML files into ArXiv.

One fun aspect of this project is being able to work with the very sharp individuals who are the center of the ongoing evolution in the Scientific, Technical, and Medical publishing space.  They range from folks in the repository space (from the National Library of Medicine, British Library, and ArXiv), in the staff of commercial publishers and journals, and key folks in many of the companies that develop tools for the publishing workflow.

Posted by pablofe | 1 Comments
Filed under: ,

Short Video Clip of the Add-in

If you want to get an idea of how the add-in works, without having to download and install it, here is a video.

 

 You should skip to 4:40 to see the demo part, I give an introduction before that (subset of the information in the earlier postings).

 We welcome your feedback and comments.  Let me know if there are specific topics you would like to see covered in future videos or postings.

I would like to take this opportunity to also highlight the collaboration with the British Library.  Earlier this year the British Library, as part of a group of institutions, launched a PubMed Central for the UK - http://www.bl.uk/news/2007/pressrelease20070105.html

"This useful Add-In will enable authors to structure their Word documents according to defined journal article templates quickly and easily and to export them and their associated metadata in a recognised format conducive to both digital storage and preservation."

Stephens Andrews, Products and Services Development Leader for e-Strategy and Information Systems at the British Library

 

Posted by pablofe | 4 Comments

The Use of XML Formats in Repositories

One interesting aspect of the XML format from the National Library of Medicine is its use in PubMed Central, which is the largest archive for biomedical articles (http://www.pubmedcentral.nih.gov/).  After articles are published in different journals (print and online), they are then submitted for archival at PubMed Central.  As part of the archival process, the articles are converted to the NLM XML format.  The NLM format is light on presentation elements (while there are elements like bold and italic, which influence the presentation of text, there is no control over background color, or border styles and colors for tables, for example), but the format encompasses a lot of metadata (more on that in a future post).

The main focus of the archival format is on the content itself, rather than on how the content is to be presented. When someone reads an article on PubMed Central, for example http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2253543&rendertype=abstract, the article is presented in HTML or PDF, generated from the XML content.  As part of this conversion, presentation related attributes are applied to the content that was expressed in XML.  Storing the content in XML provides a lot of versatility in terms of presenting content, while preserving the relevant information, and I think is likely to be a more common process in the future, across journals and other repositories, as more journals become electronic only and the content is consumed through devices with different form factors.

The collaboration with the staff at the National Center for Biotechnology Information has been critical in the development of the add-in, as well as to ensure that the output from the final version of the add-in will meet the needs of journals and conform to the requirements for archival.  Many thanks to them for their input and for their support.

“We are delighted that Microsoft has chosen to support the NLM DTD, building on Office 2007, and has produced this technology preview as a proof of concept. Having a tool that will automatically transfer an author’s work from Microsoft Word into the NLM DTD will benefit authors as well as publishers. We are eager to see Microsoft make the release version of this tool available to the community.”

-          David Lipman, M.D., Director of the National Library of Medicine’s National Center for Biotechnology Information

 

Posted by pablofe | 0 Comments
Filed under:

Welcome to the Article Authoring Add-in

Today we are making available a Technology Preview of the Article Authoring add-in for Word 2007, focused on the community of authors, editors, and publishers of scientific and technical articles.  The goal is to simplify several activities in the publishing workflow, from authoring to publishing and archiving, with this last step including conversion to the XML format from the National Library of Medicine.  The current process of getting an article from the authors to a journal (increasingly electronic only) is a bit complicated and many times lossy, especially in relation to the metadata related to the article, we hope that the add-in will help simplify and improve the process.

At the core of many publishing workflows is the XML format from the National Library of Medicine (the format is also used for long term archiving and preservation of articles – and actually there are four formats (DTDs) defined by the National Library of Medicine).  Beyond the ability to save and open files in the NLM format from Word 2007, the add-in also enables editing of the metadata, which is an important part of the format, directly from within the Word user interface.

Additionally, the add-in provides the editorial staff at the different journals with the ability to define templates, used to assist authors in the writing process, so that, as an end result, articles submitted to journals more closely match the journal requirements in relation to the different sections allowed for the article, their length, and some of the metadata required for publishing.

Editing metadata is also an important aspect of the publishing workflow, the add-in enables authors and journal staff to access and edit metadata (supplementary information that complements the content of the article, and which is very useful for search) within the Word user interface.

Content and Metadata

Beyond the core content of the article provided by the authors, there is useful information that is attached to an article, which is important for search.  The authors are best suited to enter some of this information, such as the author information (contact information, affiliation, biography), as well as some of the basic information about the content, such as keyword, and describing the taxonomy of the article (subjects being covered).  In addition, there is metadata that the editorial staff needs to provide, such as the license for the article, publishing date, and informational about the journal where the article will be published.

This type of information is not accessible through the default user interface in Word, but the add-in extends the UI to enable editing of the metadata.   The metadata entered is then kept with the content as part of the docx file, which should provide greater flexibility to the editorial and publishing staff, as they will be able to, for example, send files back to the authors to review last minute changes or to take updates.

OpenXML as the Enabling Technology

Starting from the OpenXML content simplifies the conversion process from one XML format to another, but beyond this, there are a couple of OpenXML features that make the overall solution come together.  In future postings I will cover these in more detail, but custom schemas and the ability to store additional information in the file (through the Open Packaging Conventions) are key in being able to package the content and metadata in a single file, which can then be opened and edited by any tools as part of the publishing process.

The Technology Preview Build

We are providing this preview build to gather feedback and requirements.  The target audience is the editorial and publishing staff at STM journals, companies that develop publishing tools, and technical staff at Information Repositories, libraries, and archives.  Researchers and scientists that are early adopters can also download the preview to provide feedback on the user experience.  Do not use this build for production purposes, only for evaluation.

We welcome your feedback and, if you run into any issues, please let us know about them as well.  We will be posting updates and answering your questions in this blog.  Also, let us know  if there are specific topics you will like us to do follow up postings on.

We hope that you will find this add-in useful and that it will help simplify your workflow and authoring experience.

The NLM tagsets

The add-in provides support for the Journal Publishing DTD - http://dtd.nlm.nih.gov/publishing/tag-library/2.2/.  For more information on the different formats, check out this overview http://dtd.nlm.nih.gov/.

Posted by pablofe | 3 Comments
 
Page view tracker