Carl Nolan’s ramblings on development
When I put together the “Generics based Framework for .Net Hadoop MapReduce Job Submission” code one of the goals was to support XML file processing. This was achieved by the creation of a modified Mahout document reader where one can specify the XML node to be presented for processing. But what if ones wants to process XML documents in Hive. Fortunately Hive similarly supports document readers, thus enabling the same document readers to be used as the basis of table definitions.
The process of enabling XML processing in Hive is relatively straightforward:
The syntax for the xpath processing in Hive can be found at: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+XPathUDF
So onto a simple example.
In “SampleScripts” folder of the MapReduce Framework download there is a script that extracts Store information, in XML format, from the sample AdventureWorks database. A sample of the output is as follows:
The record reader to be used will be “XmlElementStreamingInputFormat”. This document reader using a configuration element to define the XML node to be located, which then outputs for each row a single column consisting of the complete node contents.
Using this record reader a table can be defined consisting of a single XML column:
add JARS file:///C:/Users/Carl/Projects/MSDN.Hadoop.MapReduce/Release/msdn.hadoop.readers.jar; set xmlinput.element=Store;
CREATE EXTERNAL TABLE StoresXml (storexml string) STORED AS INPUTFORMAT 'msdn.hadoop.mapreduce.input.XmlElementStreamingInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION '/user/Carl/stores/demographics';
The INPUTFORMAT option allows for the definition of the required document reader. The OUTPUTFORMAT specified is the default for Hive. In this example I have defined an EXTERNAL table over the directory containing the extracted XML; independently copied to the Hadoop cluster.
The ADD JARS statement ensures the document reader is available for job execution. The SET statement configures the job such that the document reader knows what XML node to extract.
Once this table is defined you can use it like any other Hive table. If one selects from this table you will get each Store element as a row. However, xpath processing allows you to extract the XML attributes as native types.
In addition to SELECT operations one also has the option of creating a VIEW that parses the XML and presents the data using native types:
CREATE VIEW Stores(BusinessEntityID, BusinessType, BankName, AnnualSales, AnnualRevenue) AS SELECT xpath_int (storexml, '/Store/BusinessEntityID'), xpath_string (storexml, '/Store/Demographics/*[local-name()=\'StoreSurvey\']/*[local-name()=\'BusinessType\']'), xpath_string (storexml, '/Store/Demographics/*[local-name()=\'StoreSurvey\']/*[local-name()=\'BankName\']'), xpath_double (storexml, '/Store/Demographics/*[local-name()=\'StoreSurvey\']/*[local-name()=\'AnnualSales\']'), xpath_double (storexml, '/Store/Demographics/*[local-name()=\'StoreSurvey\']/*[local-name()=\'AnnualRevenue\']') FROM StoresXml;
Using the Stores definition one can now process the XML data files through the normal Hive operations. Continuing with the same samples in the download one can now easily generate a revenue summary across the banks:
SELECT BusinessType, BankName, CAST(SUM(AnnualSales) AS INT) AS TotalSales FROM Stores GROUP BY BusinessType, BankName;
As expected, under the covers the necessary MapReduce jobs are executed to aggregate the data, returning:
BM Guardian Bank 43200000 BM International Bank 43200000 BM International Security 43200000 BM Primary Bank & Reserve 42800000 BM Primary International 42200000 BM Reserve Security 42200000 BM United Security 42200000 BS Guardian Bank 88400000 BS International Bank 87400000 BS International Security 88400000 BS Primary Bank & Reserve 88400000 BS Primary International 87400000 BS Reserve Security 87400000 BS United Security 87400000 OS Guardian Bank 192000000 OS International Bank 186000000 OS International Security 186000000 OS Primary Bank & Reserve 186000000 OS Primary International 186000000 OS Reserve Security 186000000 OS United Security 186000000
If you download the aforementioned code, the sample for the Hive execution can be found in the “SampleScripts” folder. The “DocumentInputReaders” folder also contains the XML document reader classes along with a usable JAR file.
I have a question about this class org.apache.mahout.classifier.bayes.XmlElementStreamingInputFormat
Where did it come from? I can't see it anywhere in the Mahout source distribution.
Is this a class that you wrote? If so, it would really be better if you didn't use the org.apache.mahout package, if only for the sake of your readers.
Feel free to email me if you have questions at tdunning at apache.org or on the Mahout dev mailing list.
I hits error when execute below statement:
CREATE EXTERNAL TABLE StoresXml (storexml string)
STORED AS INPUTFORMAT 'org.apache.mahout.classifier.bayes.XmlElementStreamingInputFormat'
ementStreamingInputFormat : Unsupported major.minor version 51.0
at java.lang.ClassLoader.defineClass1(Native Method)
What wrong? How to fix?
The error relating to java.lang.UnsupportedClassVersionError: org/apache/mahout/classifier/bayes/XmlElementStreamingInputFormat : Unsupported major.minor version 51.0 could be down to the current code being compiled with the Oracle SDK. Are you running this on Azure. If so you may need to recompile the Java classes. There is a script to do this, just point it to the correct javac exe.
Is there a way to get the "xmlinput.element" property set in the TBLPROPERTIES during the CREATE? Something similar to how the mahout XmlInputFormat does with "xmlinput.start" and "xmlinput.end"?
What happens if there is a text node that is inside "store" but untagged? How would you make this into a field?