XML Processing with Hive XML SerDe

Hive XML SerDe is an XML processing library based on Hive SerDe  (serializer / deserializer) framework. It relies on XmlInputFormat from Apache Mahout project to shred the input file into XML fragments based on specific start and end tags. You can find more about XmlInputFormat in “Hadoop in Practice”. The XML SerDe queries the XML fragments with XPath Processor to populate Hive tables. You can find the inner workings of this library here. In this posting, I will go over an example of XML processing in Hive using XML SerDe library.