In a previous post I talked about Hadoop Binary Streaming for the processing of Microsoft Office Word documents. However, due to there popularity, I thought inclusion for support of Adobe PDF documents would be beneficial. To this end I have updated the source code to support processing of both “.docx” and “.pdf” documents.
To support reading PDFs I have used the open source library provided by iText (http://itextpdf.com/). iText is a library that allows you to read, create and manipulate PDF documents (http://itextpdf.com/download.php). The original code was written in Java but a port for .Net is also available (http://sourceforge.net/projects/itextsharp/files/).
In using these libraries I only use the PdfReader class, from the Core library. This class allows one to derive the page count, and the Author from an Info property.
To use the library in Hadoop one just has to specify a file property for the iTextSharp core library:
-file "C:\Reference Assemblies\itextsharp.dll"
This assumes the downloaded and extracted DLL has been copied to and referenced from the “Reference Assemblies” folder.
To support the PDF document inclusion only two changes were necessary to the code.
Firstly, a new Mapper was defined that supports the processing of a PdfReader type and returns the author and pages for the document:
Secondly one has to call the correct mapper based on the document type; namely the file extension:
And that is it.
In Microsoft Word, if one needs to process the actual text/words of a document, this is relatively straight-forward:
document.MainDocumentPart.Document.Body.InnerText
Using iText the text/word extraction code is a little more complex but relativity easy. An example can be found here:
http://itextpdf.com/examples/iia.php?id=275
Enjoy!