XML parsing in Hive

Working with complex XML data that contains multiple collections in it. From the sample file you find book authors with multiple book titles, genre and other details. We will try to find solution using Hive.

Sample xml file (sample.xml)

1
2
3
4
5
6
7
8
9
<book><author>Gambardella, Matthew</author><title>XML Developer's Guide</title><genre>Computer</genre><price discount="0.15">44.95</price><publish_date>2000-10-01</publish_date><description>An in-depth look at creating applications with XML.</description></book>
<book><author>Corets, Eva</author><title>Maeve Ascendant</title><genre>Fantasy</genre><price discount="0.15">5.95</price><publish_date>2000-11-17</publish_date><description>After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.</description><title>Oberon's Legacy</title><genre>Fantasy</genre><price discount="0.15">5.95</price><publish_date>2001-03-10</publish_date><description>In post-apocalypse England, the mysterious agent known only as Oberon helps to create a new life for the inhabitants of London. Sequel to Maeve Ascendant.</description><title>The Sundered Grail</title><genre>Fantasy</genre><price discount="0.15">5.95</price><publish_date>2001-09-10</publish_date><description>The two daughters of Maeve, half-sisters, battle one another for control of England. Sequel to Oberon's Legacy.</description></book>
<book><author>Ralls, Kim</author><title>Midnight Rain</title><genre>Fantasy</genre><price discount="0.15">5.95</price><publish_date>2000-12-16</publish_date><description>A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.</description></book>
<book><author>Randall, Cynthia</author><title>Lover Birds</title><genre>Romance</genre><price discount="0.15">4.95</price><publish_date>2000-09-02</publish_date><description>When Carla meets Paul at an ornithology conference, tempers fly as feathers get ruffled.</description></book>
<book><author>Thurman, Paula</author><title>Splish Splash</title><genre>Romance</genre><price discount="0.15">4.95</price><publish_date>2000-11-02</publish_date><description>A deep sea diver finds true love twenty thousand leagues beneath the sea.</description></book>
<book><author>Knorr, Stefan</author><title>Creepy Crawlies</title><genre>Horror</genre><price discount="0.15">4.95</price><publish_date>2000-12-06</publish_date><description>An anthology of horror stories about roaches,centipedes, scorpionsand other insects.</description></book>
<book><author>Kress, Peter</author><title>Paradox Lost</title><genre>Science Fiction</genre><price discount="0.15">6.95</price><publish_date>2000-11-02</publish_date><description>After an inadvertant trip through a Heisenberg Uncertainty Device, James Salway discovers the problems of being quantum.</description></book>
<book><author>O'Brien, Tim</author><title>Microsoft .NET: The Programming Bible</title><genre>Computer</genre><price discount="0.15">36.95</price><publish_date>2000-12-09</publish_date><description>Microsoft's .NET initiative is explored in detail in this deep programmer's reference.</description><title>MSXML3: A Comprehensive Guide</title><genre>Computer</genre><price discount="0.15">36.95</price><publish_date>2000-12-01</publish_date><description>The Microsoft MSXML3 parser is covered in detail, with attention to XML DOM interfaces, XSLT processing, SAX and more.</description></book>
<book><author>Galos, Mike</author><title>Visual Studio 7: A Comprehensive Guide</title><genre>Computer</genre><price discount="0.15">49.95</price><publish_date>2001-04-16</publish_date><description>Microsoft Visual Studio 7 is explored in depth, looking at how Visual Basic, Visual C++, C#, and ASP+ are integrated into a comprehensive development environment.</description></book>

This xml file contains the following fields –
author, title, genre, price, discount, publish_date, description

Step 1:

Load the xml data into a temporary table. This table is used for further processing.

1
create temporary table if not exists mydb.xml_temp(line string);

Now load xml file data into this temporary table xml_temp

1
load data local inpath 'sample.xml' into table mydb.xml_temp;

Step 2:

Now create another temporary table. This temporary table will extract data from xml field collections. Our xml data fields –
title, genre, price, discount, publish_date, description
are collections of field author.
For each author there are multiple collections. For these multiple collections, create field datatype as array.

1
2
3
4
5
6
7
8
9
10
create temporary table if not exists mydb.books_temp(
       author string,
       title array<string>,
       genre array<string>,
       price array<string>,
       discount array<string>,
       publish_date array<string>,
       description array<string>)
       row format delimited
       fields terminated by '|';

Now load data from table xml_temp into table books_temp and extract all collections. To extract each collection we use XPathUDF of hive – xpath_string, xpath_int, xpath etc.

Please refer the link for more information.

1
2
3
4
5
6
7
8
9
10
11
insert overwrite table mydb.books_temp
select xpath_string(line, 'book/author'),
       xpath(line, 'book/title/text()'),
       xpath(line, 'book/genre/text()'),
       xpath(line, 'book/price/text()'),
       xpath(line, 'book/price/@discount'),
       xpath(line, 'book/publish_date/text()'),
       xpath(line, 'book/description/text()')
from   mydb.xml_temp;

select * from mydb.books_temp;

Output –

1
2
3
4
5
6
7
8
9
Gambardella, Matthew    ["XML Developer's Guide"]   ["Computer"]    ["44.95"]   ["0.15"]    ["2000-10-01"]  ["An in-depth look at creating applications with XML."]
Corets, Eva ["Maeve Ascendant","Oberon's Legacy","The Sundered Grail"]  ["Fantasy","Fantasy","Fantasy"] ["5.95","5.95","5.95"]  ["0.15","0.15","0.15"]  ["2000-11-17","2001-03-10","2001-09-10"]    ["After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.","In post-apocalypse England, the mysterious agent known only as Oberon helps to create a new life for the inhabitants of London. Sequel to Maeve Ascendant.","The two daughters of Maeve, half-sisters, battle one another for control of England. Sequel to Oberon's Legacy."]
Ralls, Kim  ["Midnight Rain"]   ["Fantasy"] ["5.95"]    ["0.15"]    ["2000-12-16"]  ["A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world."]
Randall, Cynthia    ["Lover Birds"] ["Romance"] ["4.95"]    ["0.15"]    ["2000-09-02"]  ["When Carla meets Paul at an ornithology conference, tempers fly as feathers get ruffled."]
Thurman, Paula  ["Splish Splash"]   ["Romance"] ["4.95"]    ["0.15"]    ["2000-11-02"]  ["A deep sea diver finds true love twenty thousand leagues beneath the sea."]
Knorr, Stefan   ["Creepy Crawlies"] ["Horror"]  ["4.95"]    ["0.15"]    ["2000-12-06"]  ["An anthology of horror stories about roaches,centipedes, scorpionsand other insects."]
Kress, Peter    ["Paradox Lost"]    ["Science Fiction"] ["6.95"]    ["0.15"]    ["2000-11-02"]  ["After an inadvertant trip through a Heisenberg Uncertainty Device, James Salway discovers the problems of being quantum."]
O'Brien, Tim    ["Microsoft .NET: The Programming Bible","MSXML3: A Comprehensive Guide"]   ["Computer","Computer"] ["36.95","36.95"]   ["0.15","0.15"] ["2000-12-09","2000-12-01"] ["Microsoft's .NET initiative is explored in detail in this deep programmer's reference.","The Microsoft MSXML3 parser is covered in detail, with attention to XML DOM interfaces, XSLT processing, SAX and more."]
Galos, Mike ["Visual Studio 7: A Comprehensive Guide"]  ["Computer"]    ["49.95"]   ["0.15"]    ["2001-04-16"]  ["Microsoft Visual Studio 7 is explored in depth, looking at how Visual Basic, Visual C++, C#, and ASP+ are integrated into a comprehensive development environment."]

If you see, for some authors there are multiple prices, discounts, titles, genre, descriptions as array. This data needs to be arranged in structured format.

Step 3:

To get structured data where multiple columns are arrays, we need to go with UDTF (User Defined Table Function).

Use “explode()” to split this array data from columnar format to row format,
i.e., from horizontal format
["Maeve Ascendant","Oberon’s Legacy","The Sundered Grail"]
to vertical form –
Maeve Ascendant
Oberon’s Legacy
The Sundered Grail

How to use UDTF explode()?

Invalid use of explode() in SQL SELECT

1
select author, explode(price) as price from mydb.books_temp;
1
select explode(price) as price, explode(discount) as discount from mydb.books_temp;

Note –

  • UDTF explode() with more than 1 different column expressions CANNOT be given in SELECT statement
  • Along with UDTFs other column expressions are NOT valid in SELECT statements

Valid use of explode() in SQL SELECT –

1
select explode(price) as price from mydb.books_temp;

But there are situations where we need other column expression along with UDTF columns which gives meaning to the data.

Solution to this scenario is using “lateral view” along with UDTF “explode()” –

1
2
3
4
select  author, title, genre, publish_date, description, bprice, bdiscount
from    mydb.books_temp
lateral view explode(price) p as bprice
lateral view explode(discount) d as bdiscount;

But, this has some issues. If you see output data of the above SELECT, it is Cartesian Joined (Cross Join) for all the prices and discounts.

Solution –
When data has multiple columns with array data types, we need to split each column into different tables.

Since our data has only 5 columns with data type as array, create the following temporary tables with additional field “seqno“. This seqno is used for 1 to 1 join

Keep all data related to –
    title into books_temp_title
    genre into books_temp_genre
    price into books_temp_price
    discount into books_temp_discount
    publish_date into books_temp_publish_date
    description into books_temp_description

Step 4:

Following temporary tables are created for the above mentioned column fields –

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
create temporary table if not exists mydb.books_temp_title(
       seqno int,
       author string,
       title string)
       row format delimited
       fields terminated by '|';

create temporary table if not exists mydb.books_temp_genre(
       seqno int,
       author string,
       genre  string)
       row format delimited
       fields terminated by '|';

create temporary table if not exists mydb.books_temp_price(
       seqno int,
       author string,
       price string)
       row format delimited
       fields terminated by '|';

create temporary table if not exists mydb.books_temp_discount(
       seqno int,
       author string,
       discount string)
       row format delimited
       fields terminated by '|';

create temporary table if not exists mydb.books_temp_publish_date(
       seqno int,
       author string,
       publish_date string)
       row format delimited
       fields terminated by '|';

create temporary table if not exists mydb.books_temp_description(
       seqno int,
       author string,
       description string)
       row format delimited
       fields terminated by '|';

Step 5:

Now load data into the temporary tables created in Step 4 –

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
insert  overwrite table mydb.books_temp_title
select  row_number() over(), author, btitle
from    mydb.books_temp
lateral view explode(title) t as btitle;

insert  overwrite table mydb.books_temp_genre
select  row_number() over(), author, bgenre
from    mydb.books_temp
lateral view explode(genre) g as bgenre;

insert  overwrite table mydb.books_temp_price
select  row_number() over(), author, bprice
from    mydb.books_temp
lateral view explode(price) p as bprice;

insert  overwrite table mydb.books_temp_discount
select  row_number() over(), author, bdiscount
from    mydb.books_temp
lateral view explode(discount) d as bdiscount;

insert  overwrite table mydb.books_temp_publish_date
select  row_number() over(), author, bpublish_date
from    mydb.books_temp
lateral view explode(publish_date) pb as bpublish_date;

insert  overwrite table mydb.books_temp_description
select  row_number() over(), author, bdescription
from    mydb.books_temp
lateral view explode(description) d as bdescription;

— Validate data load

1
2
3
4
5
6
select * from mydb.books_temp_title;
select * from mydb.books_temp_genre;
select * from mydb.books_temp_price;
select * from mydb.books_temp_discount;
select * from mydb.books_temp_publish_date;
select * from mydb.books_temp_description;

Step 6:

Now create master table to store the structured data from the above temporary tables that can be used for further aggregation / operations.

1
2
3
4
5
6
7
8
9
10
11
create table if not exists mydb.books(
       seqno int,
       author string,
       title string,
       genre string,
       price double,
       discount double,
       publish_date date,
       description string)
       row format delimited
       fields terminated by '|';

Load data from table books_temp_* to master table books.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
insert overwrite table mydb.books
select a.seqno, a.author, a.title, b.genre,
       cast(c.price as double),
       cast(d.discount as double),
       to_date(e.publish_date),
       f.description
from   mydb.books_temp_title a,
       mydb.books_temp_genre b,
       mydb.books_temp_price c,
       mydb.books_temp_discount d,
       mydb.books_temp_publish_date e,
       mydb.books_temp_description f
where  a.seqno = b.seqno
and    a.author = b.author
and    a.seqno = c.seqno
and    a.author = c.author
and    a.seqno = d.seqno
and    a.author = d.author
and    a.seqno = e.seqno
and    a.author = e.author
and    a.seqno = f.seqno
and    a.author = f.author;

After loading data into the master table, verify the data

1
select * from mydb.books;

Once xml file data is loaded into master table, you need to drop the temporary tables. Otherwise this will kill the space unnecessarily

1
2
3
4
5
6
7
8
drop table mydb.xml_temp;
drop table mydb.books_temp;
drop table mydb.books_temp_title;
drop table mydb.books_temp_genre;
drop table mydb.books_temp_price;
drop table mydb.books_temp_discount;
drop table mydb.books_temp_publish_date;
drop table mydb.books_temp_description;

Hope you like the post. You can post your comments/suggestions below.

Leave a Reply

Your email address will not be published. Required fields are marked *