XML Processing with Hive XML SerDe

Hive XML SerDe is an XML processing library based on Hive SerDe  (serializer / deserializer) framework. It relies on XmlInputFormat from Apache Mahout project to shred the input file into XML fragments based on specific start and end tags. You can find more about XmlInputFormat in “Hadoop in Practice”. The XML SerDe queries the XML fragments with XPath Processor to populate Hive tables. You can find the inner workings of this library here. In this posting, I will go over an example of XML processing in Hive using XML SerDe library.

In our example, we will use the ebay data downloaded from University of Washington’s XML Data Repository site. Download the ebay.xml file found here; extract and store the file in a folder of your choice.

Example

    • Download the latest version of hivexmlserde.jar from here and copy it to your /lib folder.
    • In our example, the XML fragments are based on  and as the start and end tags respectively in the ebay.xml file. Let’s create the ebay_listing Hive table by executing the following CREATE TABLE Hive statement:
    • If the table creation is successful, load the previously downloaded ebay.xml file into the newly created Hive table by executing the following command (Note that the ebay.xml is located in C:/data/directory in my example. You have to change the location accordingly):
LOAD DATA LOCAL INPATH 'C:/data/ebay.xml'
OVERWRITE INTO TABLE ebay_listing;
    • Once the data is loaded successfully, you can query the data.
SELECT seller_name, bidder_name, location, bid_history["highest_bid_amount"], item_info["cpu"]
FROM ebay_listing LIMIT 1;

Comments

  • Though it’s relatively easy to use, the table definition may take some time in getting used to.
  • I haven’t checked the performance against a large XML file yet. I will update my post once I have the performance numbers.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s