Welcome to MSDN Blogs Sign in | Join | Help

Workshop at the Open Repositories ‘09 conference

There is limited space available, but if you are going to be at the Open Repositories '09 conference, you may want to register for and attend the workshop we are presenting on Thursday, May 21st, from 1 to 6 PM.

The workshop will be a one-stop shop encompasing all the tools and services that are being sponsored by Microsoft Research in the Scholarly Communication space.  The center piece of the workshop is going to be Zentity, the research-output repository platform currently in Beta 2.  Also covered will be the Article Authoring and Ontology add-ins, on the authoring front, as well as the Electronic Journals Service and the Research Information Center projects.

The workshop will highlight a couple of services that are useful for the consumption and archiving of articles, the Machine Translation and the Document Conversion services.

More details on this and other workshops are available at https://or09.library.gatech.edu/workshops.php.

Posted by pablofe | 1 Comments

Upcoming Events and Updates

There are two conferences coming up in May, where we will be presenting updates on the Article Authoring Add-in and other projects:

These are great opportunities to discuss authoring, metadata, and semantics, as well as the evolution of authoring in relation to the publishing workflow.

In addition to the evolution on core functionality related to semantics and metadata, as well as improvements in the user experience, there are a number of new features in the Authoring Add-in, looking at better integration with submission and archival workflows.  One of these new integration features relates to submissions from Word using the SWORD protocol, which is based on Atom Publishing Protocol. And the other relates to mapping relationships between content in the document and data, through the use of ORE resource maps.

I will be presenting at both events, so feel free to come by if you want to discuss these topics or introduce yourself.  Also feel free to contact me ahead of the events if you want to arrange meetings.

Posted by pablofe | 1 Comments

Ontology Add-in for Word 2007 - Technology Preview

We just posted a Technology Preview build of an add-in that enables adding semantic knowledge to documents by associating words in the document to ontology terms.  This add-in can be installed from CodePlex, and the source is available here under the Ms-PL license.

Add-in Basics

The add-in, developed in collaboration with the University of California San Diego and Science Commons, serves as a solution accelerator for those working in the ontology field.  The add-in works in two ways:

- background scanning

- direct tagging

By default the add-in will scan terms in the document and present suggestions for terms it recognizes using SmartTags.  Through the SmartTag menu, authors can associate recognized words to the appropriate ontology terms.

SmartTag highlighting and menu Ontology SmartTag menu

Authors can also tag words directly, by highlight the words and selecting the ontology term that they want to associate with them.

There are a large number of ontologies available online, using the OBO format, and through the configuration dialog, additional ontologies can be downloaded to the computer for use by the add-in.

Target Audience

Looking at the developer stack from higher to lower levels of abstraction, the add-in will be useful in at least three key areas:

·         Development of new ontologies

·         Investigation of new author interaction paradigms

·         Integration into publishing and semantic workflows

 

For those developing new ontologies, the add-in provides a very easy way to test those ontologies with their target audience.  In many scientific disciplines, Microsoft Word is a very popular tool for authoring papers and articles, and as such, authors are already familiar with its usage and features.  The add-in is able to seamlessly build on this familiarity to expose new functionality and additional ontologies can be downloaded through a REST interface.

 

For researchers who are focused on new ways of analyzing content and detecting terms automatically, or on extending the user interaction with authors, access to the source code provides a great foundation to build on, without having to start from scratch.  Also, community members can add incremental value to the add-in and share it back with others in the community.  For this purpose, CodePlex provides a good forum to host discussions, report bugs, and publish documents related to the project.

 

Developers that work in the publishing industry, or at libraries and repositories, can customize or extend the add-in to present a user interface specific to their organization, or to add information to the XML content when terms are recognized and tagged by the add-in.  Enhancing the information in the XML tag that captures the semantic information would also be useful to those doing semantic analysis, search, or storing information in databases.

 

Tagging

The add-in relies on custom XML tags to associate the semantic information with the matched words.  The semantic information is stored as part of the document content.  Utilities and applications that read or process docx files can retrieve and use this information (or transform it to other formats).

<w:customXml w:uri="http://biolit.ucsd.edu/biolitschema" w:element="biolit-term">

<w:customXmlPr>

<w:attr w:name="id" w:val="GO:0031386" />

<w:attr w:name="type" w:val="Biological process" />

<w:attr w:name="status" w:val="true" />

<w:attr w:name="OntName" w:val="Biological process" />

<w:attr w:name="url" w:val="http://purl.org/obo/owl/GO#GO_0031386" />

</w:customXmlPr>

<w:smartTag w:uri="BioLitTags" w:element="tag1">

<w:r>

<w:t>protein tag</w:t>

</w:r>

</w:smartTag>

</w:customXml>

Additional Functionality

The add-in also enables authors to search for terms in ontologies, look up their definition, as well as browse the ontologies to understand their organization and structure, as well as examine the terms.  The add-in also provides a way to highlight tagged terms, which makes it simple to review the document and identify all tagged words.

Ontology browser

A final useful piece of functionalty is the recognition of protein ID patterns from the National Center for Biotechnology Information (NCBI) and Protein Data Bank (PDB) databanks.

Protein ID recognition

 

Upcoming Scholarly Publishing Events in London - See You There

A few of the Microsoft people involved in Scholarly Publishing (Lee Dirks, Alex Wade, and myself) will be in London next week (December 2nd through 5th) to attend the Online Information 2008 conference, as well as present at two events from the International Association of Scientific, Technical & Medical Publishers - the STM E-Production Seminar 2008 and the STM Innovations Seminar 2008 - New Streams, New Views, New Directions, Moving to a Future Beyond Text (more information here).

This is a great opportunity for us to connect with others involved in the scholarly publishing space in Europe, so, if you attending one of these events or are in the London area, we invite you to contact us and see if we can arrange informal meetings.  Drop us a note in the comments, or feel free to say hi to us at the conferences.

Posted by pablofe | 1 Comments

Peer Review in the Age of Software as a Service

The Microsoft eJournal Service is a good example of a growing trend towards delivering functionality through the Software as a Service approach.

This web based service enables scientists and researchers to more easily engage in the collaborative process that is the foundation of Scholarly Publishing. The service aims at lowering the technical and financial barriers involved in getting a publication up and running, by removing the need for purchasing and maintaining servers, as well as installing and updating software packages. While some aspects of publishing remain the same, such as producing good research, capturing the results in an article, finding experts to review the article, and polishing the article for publishing, the goal is that the service will make the publishing process more accessible and available to a larger number of scientists.

Roles and Workflows

The service is based on three key roles: Editors, Reviewers, and Authors. Through the service, Editors can gather and review submissions from Authors, and coordinate the review process with the Reviewers. At the end of the review process, approved articles can be posted to the journal site and/or submitted to repositories, or even passed to other services.

Underlying the service is a set of workflows, which guide the different participants through the process and help manage the tasks and deadlines. These workflows support the core interactions which underlie the review process, with some options available to configure the workflow.

Format Independence and Browser Neutrality

The service does not impose the use of a particular file format, the Editor can restrict submissions to only certain file types if desired, and should be accessible through any web browser. In selecting a file format, I would advise migrating to XML based formats, such as OpenXML, which can more easily capture semantics, metadata, and relationship between content and data, and are more conducive to computer processing for search and semantic analysis.

Repositories

At the end of the process, the Editor can configure the service to deposit the articles to different repositories. One of those repositories is ArXiv, which is very popular for Physics and Math content, and can now be accessed using the SWORD protocol.

The service can also be used to deposit to other SWORD based archives. This functionality would also be useful for depositing into institutional repositories, and as such, the service could be used to manage the review process for publications such as thesis.

In order to deposit to a repository, you will need a login name and password on the system. The repository may have requirements as to the file formats supported, and their packaging, which you will need to match before submitting.

For folks in BioMed, you can also select to deposit into PubMed Central, and, as noted before, you need to be approved for deposit ahead of time, and have access to the system.

Participate!

As with the Authoring Add-in, we welcome your participation and feedback. We would love to hear from you in relation to what we can offer to make you more productive and, hopefully, make the technology disappear in the background, freeing you to focus on the task at hand and simplifying the process. Give the service the service a try at http://journal.mssandbox.net/.

Other Interesting Services

Two very interesting recently introduced services are Office Live Workspaces and Office Live Small Business. Office Live Small Business is a great example of making online presence and collaboration more accessible through a subscription model, along the lines of what we are trying to achieve with the eJournal Service. For those that are interested in the technical details, underlying Office Live Small Business is Microsoft SharePoint.

Squarely on the business front are two new software-as-a-service offerings, hosted SharePoint and hosted Exchange. Besides being useful to small, medium, and even large business, both of these services should be of useful to universities, colleges, and research institutions.

Release Candidate for Article Authoring Add-in Now Available for Download

We are happy to announce that this morning we posted the Release Candidate build of the Article Authoring Add-in.

Over the past couple of months, the community has provided very useful feedback based on the Beta 1 release.  We feel that we have refined the overall experience and addressed the key elements of the feedback we have received in this Release Candidate of the add-in.  Thank you for your engagement and support.

Starting with a simplified install experience, this latest release has a number of improvements under the covers, from enhancing the XML that is generated to improvements in the user interaction, especially for the Journal Panel.  We encourage you to download this new build and evaluate it as part of your workflow.  As you think of using the different functionality provided by the add-in, please send us your comments and requests for future releases.

Let's do a quick recap of what the add-in provides:

  • Open/Save files into the National Library of Medicine XML format

XML documents in the NLM format can be opened from within Word, edited, and saved, both as Word files and back again as XML.  The add-in also includes support for the NLM book format.

  • Access to Metadata from within the Word user interface

Author, article, and journal metadata is accessible through the user interface exposed by the add-in, enabling the editing of all information that is part of the NLM format.  Software developers can also write tools and applications to create or access this data programmatically, for example connecting the data in a document to a database.

  • Incorporating NLM semantic elements within the Word document

Starting with Sections, semantic elements appear explicitly within the document, and enable authoring in a more structured manner, better preparing the document contents for analysis, validation, and search.

  • Ability to create and use templates

The add-in installs a set of example templates: a blank article template, a blank book chapter template, and a sample article template with keywords and sections.  The blank articles are particularly useful for starting new articles, or for providing structure to content pasted in from another document.

We feel that the add-in supports the evolution to the greater use of XML as the underlying format for archiving articles.  Specially as part of the transition to electronic-first or electronic only publishing, the add-in should prove useful in generating XML content, without having first to take articles through the traditional print oriented and page layout based processes.  The resulting XML content can then be transformed for presentation, making use of the semantic information in the document to determine presentation parameters.

In addition, the add-in should be particular useful to journals/publishers in the biomedical fields, where many articles are now required to be submitted to PubMed Central for archival.

Introducing the Alpha Preview of the Microsoft eJournal Service

Update

We would like to highlight the collaboration and input from commercial and non-commercial entities in relation to the Microsoft eJournal Service and the Article Authoring add-in projects:

"I am pleased that Microsoft is taking innovative steps to support more open, efficient, and effective scholarly communication in the digital networked environment. For example, the free eJournal  Service gives many scholarly societies a valuable new option for online publication and a way to avoid taking on high costs. The Article Authoring and Creative Commons add-ins to Word also are good news, offering capacities that could bring down production costs and allow authors to better manage their intellectual property rights." – Heather Joseph, Executive Director, SPARC (Scholarly Publishing & Academic Resources Coalition)

 

"Partnering with members of the scholarly community, Microsoft External Research is working to facilitate the next step in the transformation of scholarly communications with networking tools built into Microsoft products. The Article Authoring add-in for Microsoft Word 2007 permits authors to produce documents directly in the format used by the NLM's PubMed Central repository, and is a significant step towards producing next-generation documents semantically tied to distributed network databases and relevant ontologies.  The Microsoft team has also worked with the arXiv.org database on an automated upload protocol for documents and metadata, both for ingest from individuals and of entire conferences.  We look forward to further enhancements, permitting autonomous discovery of related documents, relevant materials, and other linkages, accelerating the move towards a better integrated scholarly knowledge network." - Paul Ginsparg, professor of Physics and Information Science at Cornell University

 

 

“NCBI welcomes Microsoft’s decision to support NLM format XML in the Article Authoring add-in for Microsoft Word,” said James Ostell, Ph.D., Chief of the Information Engineering Branch at the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine.  “NLM’s archival format for electronic documents has been adopted by the Library of Congress and the British Library, and directly supporting this standard in Word is an important step toward simplifying the process to archive the scientific literature. It also opens doors to new possibilities to integrate data and tools with the traditional scientific authoring process.”

 

"The Add-In will enable scholars and scholarly publishers to use the familiar Word environment for writing, editing, and tagging scholarly articles in the industry standard NLM XML DTD. With about two million articles authored and published every year, the potential impact on this Add-In should not be underestimated."    Ahmed Hindawi, CEO of Hindawi Publishing Corporation

We would also like to thank all other publishers, companies, institutions, and individuals who helped shape the releases we introduced today.

---

This morning we are excited to be introducing the Alpha Preview of the Microsoft eJournal Service.

The service is focused primarily on electronic-only journals.  Scientists and researchers wanting to start new journals, or those with an existing small journal, should find the system useful for conducting the peer review process, and for archiving published articles to Information Repositories (IRs) and/or the public facing section of their site.  It is common for electronic-first journals to publish articles as they are approved, so the service presents an article focused process.  The philosophy of the project is to keep the workflows simple and avoid over complicating the management process and site usage with too many options. 

The goal of the Alpha Preview is to gather feedback and requirements from the community, to help us pinpoint what additional functionality and changes would be useful to you, as we enhance the service going forward.  As part of the Alpha Trial program, sites will be active only for a limited time and are restricted in the number of people that can participate and in the number of articles that can be processed.  As the service evolves, the functionality will be expanded, and more open, Beta level, programs will be offered.

The service is open to all file formats, article submissions can be of any format, as configured as part of the site settings.  On the archival side, the service supports depositing into any Information Repository that uses the SWORD protocol (for example, the ArXiv repository and EPrints based IRs).

The service nature of this offering means that users don't have to be concerned with procuring and maintaining hardware, or with software installation and updates.  The service is usable from any web browser, without requiring any local applications.  The service relies on tasks and emails to keep participants informed of the work items assigned to them and associated deadlines.

 It is our hope that, as the service evolves, it will help facilitate greater online collaboration for the scholarly communication community, and lead to great dissemination of knowledge.

 We welcome your comments and feedback on the service!

Feedback on your usage model

Participation in the Beta program so far has been fantastic and we have been able to incorporate most of the feedback and requests that you have sent in so far.  It is evident that this community is very hands-on and enthusiastic about the authoring and archiving process.

We have been able to engage in a good dialog with several community members, with folks submitting their sample documents for testing and scenarios as input.  While a lot of people have downloaded the add-in, not everyone has contacted us with comments and feedback yet, so I wanted to open up the dialog and solicit your input. 

As expected, the majority of the early adopters are part of the staff at journals, repositories, libraries, and also companies that support the publishing workflow.  There are a few enthusiasts/early adopters, as well as folks interested in the writing process and on capturing semantics, who have also tried out the add-in and sent their feedback.  Many thanks to all that have contacted us!

To help kickstart the broader dialog, here are some questions as to how you are using, or planning to use, the add-in as part of your workflow:

  • Do you start using the add-in by importing an xml file, or by pasting in content from another document?
  • Do you use the add-in for editing content, metadata, or both?
  • Do you introduce new sections into documents?  Do you tend to use custom sections?
  • Have you created templates, with sections that you use often, for re-use?
  • Have you used the keyword or subject area panel (on the right side of the window)?
  • Are there elements in the NLM format which are not currently supported in the add-in which are essential for your workflow?
  • Do you plan to access the content or metadata in the file directly, without using Word (for example using your own internal tools)?

If there is anything else you want to comment on in relation to the add-in, feel free to send it in.  You can post your answers as comments on the blog or send them over email.  This is a great opportunity to engage in the development of the add-in and help shape the experience for authoring scientific and technical articles.

And, once more, many thanks for your participation and input.

Posted by pablofe | 1 Comments

Beta of Microsoft’s Article Authoring Add-in Now Available Broadly for Download

Enabling Journals to Better Connect with Scientific Authors in a Digital World

 

 

REDMOND, Wash. — May 30, 2008 — At the annual meeting of the Society for Scholarly Publishing in Boston, Microsoft announced the wide availability of the Beta 1 release of the Article Authoring Add-in for Microsoft Word 2007. In addition to enabling Word users to open and save documents using the National Library of Medicine’s XML Journal Publishing format, used for the authoring of scientific articles, the Beta 1 release adds support for the NCBI Book format, used for authoring book chapters for digital books.

 

Enabling Journals to Better Connect with Authors in a Digital World

A key value of the Article Authoring add-in is in enabling editors at scientific and technical journals to create article templates, tailored for their individual journals’ requirements.  These templates will assist authors in writing articles with greater consistency in relation to the structure of the articles, better reflecting the content requirements of the journals, and in expressing semantic information which is key for the search and consumption of articles in digital form.

 

“The Add-In is a very positive development that will help scholars to write and tag their articles in the industry-standard NLM XML DTD, and will help publishers to process these articles in their editorial and production departments. We are pleased to be working with Microsoft on testing and refining this important tool that will benefit scholars and scholarly publishers alike”, said Ahmed Hindawi, CEO of Hindawi Publishing Corporation.

 

Preserving Information for Archiving and Search

The Article Authoring add-in enables authors to express a greater variety of semantic information, and metadata, as part of writing articles.  This semantic information, captured in the XML format and preserved based on the extensibility in the Open XML standard, will prove valuable in improving the results from search queries and for the long term archival of scientific information.

 

In addition to preserving information that is native to Microsoft Word, the Beta 1 release of the Article Authoring add-in also preserves Math information from controls, such as Design Science’s MathType, when saving Word documents to the NLM XML format.  Paul Topping, President and CEO of Design Science, Inc., stated that "We were happy to work with Microsoft to add support for Equation Editor and MathType equations to the Article Authoring add-in. Since at least 85% of the articles containing math submitted to scientific journals have equations in those formats, this support is critical."

 

The Open XML standard, with its capabilities to support custom-schemas, enables the Word add-in to support the entire set of rich information encoded by the NLM format. The add-in also provides easy access to the metadata in the NLM format, both by journal editors and by authors, directly from within the Word user interface.  The broad availability of the Beta 1 release provides a way for the different communities, such as authors, journals, digital archives, and software vendors, to evaluate the technology and provide feedback, guiding further development of the add-in towards its initial release in the second half of 2008.

 

Information on how to download the Beta 1 release of the Article Authoring Add-in for Microsoft Office Word 2007 can be found at http://www.microsoft.com/downloads/details.aspx?FamilyID=09c55527-0759-4d6d-ae02-51e90131997e&displaylang=en.

Posted by pablofe | 1 Comments
Filed under: , ,

See You at SSP Next Week?

I, along with folks from the Word team and Technical Computing Initiative/Microsoft Research, will be in Boston next week, participating in a session on Wednesday morning (Seminar 1).

Also on Friday, I will be hosting a informal discussion table during the Networking Lunch.

If you are interesting in a demo or having discussions during the event, drop a comment here, or for a more general get together, join up this Facebook event.

Hope to see you there.

Posted by pablofe | 0 Comments

May Tidbits

A few interesting news:

  •  Aries Systems now supports Word 2007 docx files
  •  The Times Reader will start a Beta of their Mac version

Last, I wanted to highlight that we will be attending (and presenting) at the upcoming Society for Scholarly Publishing meeting in Boston in a couple of weeks.  If you want to meet up, join this event on Facebook, or drop a comment on this blog.

Posted by pablofe | 0 Comments
Filed under:

The Power of Structured Content

As we get comments and questions from you about the add-in, or have opportunities to discuss the experience and underlying technology face-to-face (as at the HighWire Publisher's Meeting last week), it is interesting how frequently the topic of structured content comes up.

Looking Back and Looking Forward

Although there is more awareness, visibility, and articles on the topic of legacy content (book scanning), it is important, and perhaps even more so, to focus on how new content is being created.  As the consumption of content makes the transition from print to digital, and search becomes one of the key ways we come across new content (journals and conferences being the other traditional ways), it is important that we evolve the process by which new content is created, in order to be able to fully exploit the benefits of the new digital medium.

The way articles are commonly authored today is still largely focused on print as the end point, in that there is still a larger focus on presentation over semantics.  The semantics in many workflows are added after the article is approved, somewhere along the publishing pipeline, by people other than the article authors. We need to enable authors to add semantics as part of the authoring process, and have that content preserved through the publishing workflow.  The best technology available to preserve semantic content today is to express the article in XML.  Archiving articles as plain text (either HTML or PDF) results in a loss of information.  And, this loss of information is not only detrimental to effective search today, but will also be detrimental to other types of semantic analysis in the future.

Lossy Workflows

In some print focused workflows, even if semantic elements were present in the original file created by the author, the semantics are lost in the digital version, as a result of the process leading to print.  At the end of this type of workflows, a PDF file is generated, reflecting the print layout.  Any semantic information in the original article is lost in the resulting PDF.  The PDF file, in turn, is used to generate an XML file, which is the basis for the digital content (and usually has to be sent out for tagging).  These types of workflows not only result in a loss of semantic information, but are inefficient from the point of view of creating digital content.  As journals move to be digital first (or digital only) editors will need to pay close attention to how their workflows are structured and how data is preserved/converted.

Presentation vs Semantics

Ideally, if we do a good job at capturing semantics during authoring, we can ignore the final presentation throughout the publishing workflow.  Layout and other presentation elements (margins, font family, color, size, etc) can be applied to the archive version of the document for viewing (for example, during the generation of the HTML files).  Of course, we do not want authors dealing with raw XML editing.  Having a nice presentation within the word processor helps with authoring, but, it does not need to be the final presentation, instead it can be the presentation that best works for the author.

Currently some Word article templates rely on Styles to identify semantics.  This is the best that could be done with the capabilities of previous versions of Word, but it is fragile as it mixes presentation and semantics.  To make matters worse, it is easy for styles to "leak" through copy and paste, or editing, invalidating the semantics.  Word 2007's use of XML as its native format, its ability to have custom XML elements, and the extensibility of Word's user interface and file packaging, enable a more robust way of entering semantic elements during authoring, preserving metadata, and enabling conversion to other formats.

Capturing Semantics and Authors' Insights

Authors will be largest class of users of the add-in, so we focus a lot on the experience that is presented to this audience.  Authors likely will have no idea of the format used to back their articles (whether OpenXML or NLM), nor should they care.  Also, the richness and complexity of the metadata expressible in the NLM format does not need to be exposed to them in a raw form (but needs to be accessible to the journal/archival staff - I will cover this in a future posting).

What is the benefit to authors from capturing semantics and metadata?  As semantic search evolves, articles with more/better semantic data should become more relevant in search results than articles without this information.  Thus, as content moves to be consumed primarily in digital form, articles with better semantics stand a better chance of being found, read, and cited.

Along the concept of the Dublin Core (or the core of the Core), we are focusing on enabling journals to capture a set of data from authors:

  • Sections (title, abstract, etc)
  • Authors information
  • Keywords and subjects
  • Author notes

While it would be interesting to capture additional information within the content during authoring, it is important not to overburden authors.  At least if we manage to get this small set of data reliably, and reduce entry errors, then we will provide a good baseline for metadata in articles.  Over time, as authors become comfortable with the concept, and see benefits, the baseline can be moved up (but again, the user interaction needs to be simple and, as much as possible, unobtrusive).

Additional Reading

Peter Murray-Rust has a couple of threads with related topics, on semantics and chemistry and structured content and PDF.

Posted by pablofe | 1 Comments

The Increasing Relevance of XML in the Overall Content Lifecycle

I had a conversation with Jon Udell on the add-in, and the greater relevance of XML in the publishing lifecycle, which he just posted on the channel 10 site.

Jon has a great approach to this new series of postings, where he provides different modalities for consuming the content.  Audiences can read a transcript, watch a screencast, or listen to the podcast.  Incidentally, over the last year, after I bought a Zune with its ability to automatically manage podcast subscriptions, I started to listen to podcasts while jogging and find it a very convenient way to catch up on podcasts (both technical and general topics, such as NPR programs) while exercising.  Providing multiple formats, and adapting to how people will consume content in the future, is a trend we need to keep an eye on for evolving the STM publishing space.

On its 10th anniversary, the infrastructure around XML (formats, tools, as well as processes/workflows) is now at a point where we will finally derive greater benefits from this great technology.  XML usage will now start to take place at the point of origin for content, without authors being directly aware of it, and its usage throughout the publishing workflow will be simplified.  The net result should be an improvement in time to publish, relevance in search, and in the presentation of the content (more on this in a future post).

Publishing Workflow – Math content as paths vs glyphs in generated PDF files

Recently I was involved in diagnosing an issue where, when a PDF file was generated from Word 2007, the Math content from the Word document was being converted to paths, instead of being represented by glyphs from the Cambria Math font.

Note that you can download a free add-in to generate PDF files from Word 2007 from here.  Also, Word 2007 has quite a bit of new Math functionality, and a beautiful font to go along with it (Cambria Math).

When the goal is to generate high quality content, whether Math content is represented as paths or glyphs makes a difference.  Note that this is not something that a casual observer would necessarily notice, as seen in these screen shots at 100% magnification.

The first screenshot is from the content in Word.  The second is of the generated PDF file with the content as paths.  The third image is of the generated PDF file with the content as glyphs.  There is very little difference in the three screen shots below (at least to me).

  Original content in Word

Original content in Word

Path based content at 100%

Path based content in Adobe’s PDF viewer (100% zoom) 

Glyph based content at 100%

Glyph based content in Adobe’s PDF viewer (100% zoom)

 

However, when zooming in at 600%, it is possible to start noticing that, in the case where paths were used, the curves have discrete line segments, whereas the glyph version continues to be smooth.

 Path based content at 600%

Path based content at 600%magnification, note the aliasing on the curved segments.

Glyph based content at 600%

Glyph based content at 600% magnification, perfect!

Initially we could not reproduce the problem in our environment over here (let me tell you how much I hate it when we cannot reproduce bugs).  In talking to the folks that reported the problem, we verified that the fonts were correctly installed and the font file versions were as expected.  Folks in the Word team then tracked down under which conditions paths would get generated, and we also found out that the original problem was being seen on a Windows Server 2003 installation, not on a client configuration.  Note that this is not an issue that one would run into with Windows Vista, because it has a different default configuration.

From there it was straightforward to verify and solve the problem.  When exporting content, Word checks whether Complex Scripts or Far East scripts are enabled on the machine, to decide whether to generate paths or glyphs for Math content.  In case you run into a similar issue, the solution was to enable both scripts on the server (which may require the installation disk and a reboot), through the Languages tab in the Regional Settings control panel.

It is nice to have a happy ending to problems, and in this case being able to preserve high quality Math content.

Posted by pablofe | 3 Comments

Multi Discipline Relevance

Geoff makes the point that he thinks the NLM dtd is relevant for disciplines outside of Biology and Medicine, such as Humanities and Social Sciences.  I agree with his point and hope that with the add-in we will help authors and journals in those other disciplines, which already tend to use Word, in the submission and conversion processes.  Even going beyond the conversion to the NLM tagset, I think that there are other features in the add-in that will be of use to authors and to the editorial staff.  I plan to cover both of these aspects in future postings.

Covering yet another set of scientific disciplines, last month the folks from ArXiv (the largest repository for Physics, also very popular for Math and Computer Science papers) posted the news that they now accept OpenXML files for submissions.  I am also looking forward to being able to help scientists and researchers that deposit OpenXML files into ArXiv.

One fun aspect of this project is being able to work with the very sharp individuals who are the center of the ongoing evolution in the Scientific, Technical, and Medical publishing space.  They range from folks in the repository space (from the National Library of Medicine, British Library, and ArXiv), in the staff of commercial publishers and journals, and key folks in many of the companies that develop tools for the publishing workflow.

Posted by pablofe | 1 Comments
Filed under: ,
More Posts Next page »
 
Page view tracker