Hive and XML File Processing

Hive and XML File Processing

Rate This
  • Comments 4

When I put together the “Generics based Framework for .Net Hadoop MapReduce Job Submission” code one of the goals was to support XML file processing. This was achieved by the creation of a modified Mahout document reader where one can specify the XML node to be presented for processing. But what if ones wants to process XML documents in Hive. Fortunately Hive similarly supports document readers, thus enabling the same document readers to be used as the basis of table definitions.

The process of enabling XML processing in Hive is relatively straightforward:

  • Create the table definition specifying that the input format is XML; thus exposing the necessary XML elements as columns
  • Parse the XML column data using xpath expressions in SELECT statements
  • or – Define a view on the XML table parsing out the relevant XML elements, returning them as native types

The syntax for the xpath processing in Hive can be found at: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+XPathUDF

So onto a simple example.

In “SampleScripts” folder of the MapReduce Framework download there is a script that extracts Store information, in XML format, from the sample AdventureWorks database. A sample of the output is as follows:

  1. <Root>
  2.   <Store>
  3.     <BusinessEntityID>292</BusinessEntityID>
  4.     <Name>Next-Door Bike Store</Name>
  5.     <SalesPersonID>279</SalesPersonID>
  6.     <Demographics>
  7.       <StoreSurvey xmlns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/StoreSurvey">
  8.         <AnnualSales>800000</AnnualSales>
  9.         <AnnualRevenue>80000</AnnualRevenue>
  10.         <BankName>United Security</BankName>
  11.         <BusinessType>BM</BusinessType>
  12.         <YearOpened>1996</YearOpened>
  13.         <Specialty>Mountain</Specialty>
  14.         <SquareFeet>21000</SquareFeet>
  15.         <Brands>2</Brands>
  16.         <Internet>ISDN</Internet>
  17.         <NumberEmployees>13</NumberEmployees>
  18.       </StoreSurvey>
  19.     </Demographics>
  20.     <Modified>2008-10-13T11:15:07.497</Modified>
  21.   </Store>
  22.   ...
  23.   <Store>
  24.     <BusinessEntityID>374</BusinessEntityID>
  25.     <Name>Immense Manufacturing Company</Name>
  26.     <SalesPersonID>277</SalesPersonID>
  27.     <Demographics>
  28.       <StoreSurvey xmlns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/StoreSurvey">
  29.         <AnnualSales>3000000</AnnualSales>
  30.         <AnnualRevenue>300000</AnnualRevenue>
  31.         <BankName>Guardian Bank</BankName>
  32.         <BusinessType>OS</BusinessType>
  33.         <YearOpened>1998</YearOpened>
  34.         <Specialty>Touring</Specialty>
  35.         <SquareFeet>76000</SquareFeet>
  36.         <Brands>4+</Brands>
  37.         <Internet>DSL</Internet>
  38.         <NumberEmployees>73</NumberEmployees>
  39.       </StoreSurvey>
  40.     </Demographics>
  41.     <Modified>2008-10-13T11:15:07.497</Modified>
  42.   </Store>
  43. </Root>

 

The record reader to be used will be “XmlElementStreamingInputFormat”. This document reader using a configuration element to define the XML node to be located, which then outputs for each row a single column consisting of the complete node contents.

Using this record reader a table can be defined consisting of a single XML column:

add JARS file:///C:/Users/Carl/Projects/MSDN.Hadoop.MapReduce/Release/msdn.hadoop.readers.jar;
set xmlinput.element=Store;

CREATE EXTERNAL TABLE StoresXml (storexml string)
STORED AS INPUTFORMAT 'msdn.hadoop.mapreduce.input.XmlElementStreamingInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/user/Carl/stores/demographics';

The INPUTFORMAT option allows for the definition of the required document reader. The OUTPUTFORMAT specified is the default for Hive. In this example I have defined an EXTERNAL table over the directory containing the extracted XML; independently copied to the Hadoop cluster.

The ADD JARS statement ensures the document reader is available for job execution. The SET statement configures the job such that the document reader knows what XML node to extract.

Once this table is defined you can use it like any other Hive table. If one selects from this table you will get each Store element as a row. However, xpath processing allows you to extract the XML attributes as native types.

In addition to SELECT operations one also has the option of creating a VIEW that parses the XML and presents the data using native types:

CREATE VIEW Stores(BusinessEntityID, BusinessType, BankName, AnnualSales, AnnualRevenue) AS
    SELECT
        xpath_int (storexml, '/Store/BusinessEntityID'),
        xpath_string (storexml, '/Store/Demographics/*[local-name()=\'StoreSurvey\']/*[local-name()=\'BusinessType\']'),
        xpath_string (storexml, '/Store/Demographics/*[local-name()=\'StoreSurvey\']/*[local-name()=\'BankName\']'),
        xpath_double (storexml, '/Store/Demographics/*[local-name()=\'StoreSurvey\']/*[local-name()=\'AnnualSales\']'),
        xpath_double (storexml, '/Store/Demographics/*[local-name()=\'StoreSurvey\']/*[local-name()=\'AnnualRevenue\']')
    FROM StoresXml;

Using the Stores definition one can now process the XML data files through the normal Hive operations. Continuing with the same samples in the download one can now easily generate a revenue summary across the banks:

SELECT BusinessType, BankName, CAST(SUM(AnnualSales) AS INT) AS TotalSales FROM Stores
GROUP BY BusinessType, BankName;

As expected, under the covers the necessary MapReduce jobs are executed to aggregate the data, returning:

BM      Guardian Bank   43200000
BM      International Bank      43200000
BM      International Security  43200000
BM      Primary Bank & Reserve  42800000
BM      Primary International   42200000
BM      Reserve Security        42200000
BM      United Security 42200000
BS      Guardian Bank   88400000
BS      International Bank      87400000
BS      International Security  88400000
BS      Primary Bank & Reserve  88400000
BS      Primary International   87400000
BS      Reserve Security        87400000
BS      United Security 87400000
OS      Guardian Bank   192000000
OS      International Bank      186000000
OS      International Security  186000000
OS      Primary Bank & Reserve  186000000
OS      Primary International   186000000
OS      Reserve Security        186000000
OS      United Security 186000000

If you download the aforementioned code, the sample for the Hive execution can be found in the “SampleScripts” folder. The “DocumentInputReaders” folder also contains the XML document reader classes along with a usable JAR file.

Leave a Comment
  • Please add 7 and 2 and type the answer here:
  • Post
  • Carl,

    I have a question about this class org.apache.mahout.classifier.bayes.XmlElementStreamingInputFormat

    Where did it come from?  I can't see it anywhere in the Mahout source distribution.

    Is this a class that you wrote?  If so, it would really be better if you didn't use the org.apache.mahout package, if only for the sake of your readers.

    Feel free to email me if you have questions at tdunning at apache.org or on the Mahout dev mailing list.

  • I hits error when execute below statement:

    CREATE EXTERNAL TABLE StoresXml (storexml string)

    STORED AS INPUTFORMAT 'org.apache.mahout.classifier.bayes.XmlElementStreamingInputFormat'

    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

    LOCATION '/stores/demographics';

    Error:

    java.lang.UnsupportedClassVersionError: org/apache/mahout/classifier/bayes/XmlEl

    ementStreamingInputFormat : Unsupported major.minor version 51.0

           at java.lang.ClassLoader.defineClass1(Native Method)

           at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)

           at java.lang.ClassLoader.defineClass(ClassLoader.java:615)

           at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:14

    1)

    What wrong? How to fix?

  • The error relating to java.lang.UnsupportedClassVersionError: org/apache/mahout/classifier/bayes/XmlElementStreamingInputFormat : Unsupported major.minor version 51.0 could be down to the current code being compiled with the Oracle SDK. Are you running this on Azure. If so you may need to recompile the Java classes. There is a script to do this, just point it to the correct javac exe.

  • Is there a way to get the "xmlinput.element" property set in the TBLPROPERTIES during the CREATE?  Something similar to how the mahout XmlInputFormat does with "xmlinput.start" and "xmlinput.end"?

Page 1 of 1 (4 items)