Today we will see how to load fixed width files into Hive database. We use SerDe properties of Hive to load fixed width files.
Initially we will create a staging table before loading fixed file data into table. We create table using SerDe properties by specifying the lengths of each column. You can refer my post “Create table in Hive“ for syntax.
Sample fixed length file (sample_file.txt):
EID NAME AGESALARY DEPT 1001Subbayya Sivasankaranarayana Pillai 25 25000.00HR 1002Raj Chandra Bose 27 310000.00FIN 1003Tirukkannapuram Vijayaraghavan 30 44000.00MKT 1004Dattaraya Ramchandra Kaprekar 21 82345.00EDU 1005Samarendra Nath Roy 24 123456.00ADM 1006 Madame Curie 26 723456.00SCI 1007 Rosalind Franklin 23 321456.00SCI
Here we have fields with the following length, start position and end position.
+---------------------------------------+ | | |Start |End | |Field |Length |position |position | +---------------------------------------+ |EID |4 |1 |4 | |NAME |35 |5 |39 | |AGE |3 |40 |42 | |SALARY |11 |43 |53 | |DEPT |4 |54 |57 | +---------------------------------------+
Following is the create table for fixed width file:—
CREATE TABLE employees_stg (emplid STRING, name STRING, age INT, salary DOUBLE, dept STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "(.{4})(.{35})(.{3})(.{11})(.{4})", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s" ) LOCATION '/path/to/input/employees_stg';
The above query will throw error —
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.contrib.serde2.RegexSerDe only accepts string columns, but column[2] named age has type int)
We should create table with all columns with data type as string. All other data types will throw error as above.
CREATE TABLE employees_stg (emplid STRING, name STRING, age STRING, salary STRING, dept STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "(.{4})(.{35})(.{3})(.{11})(.{4})", --Length of each column specified between braces "({})" "output.format.string" = "%1$s %2$s %3$s %4$s %5$s" --Output in string format ) LOCATION '/path/to/input/employees_stg'; LOAD DATA INPATH '/path/to/sample_file.txt' INTO TABLE employees_stg; SELECT * FROM employees_stg;
EID NAME AGE SALARY DEPT
1001 Subbayya Sivasankaranarayana Pillai 25 25000.00 HR
1002 Raj Chandra Bose 27 310000.00 FIN
1003 Tirukkannapuram Vijayaraghavan 30 44000.00 MKT
1004 Dattaraya Ramchandra Kaprekar 21 82345.00 EDU
1005 Samarendra Nath Roy 24 123456.00 ADM
1006 Madame Curie 26 723456.00 SCI
1007 Rosalind Franklin 23 321456.00 SCI
Time taken: 0.063 seconds, Fetched: 8 row(s)
Disadvantages —
1. If you have noticed, when we say “LOAD DATA INPATH …“ input file is NOT copied and instead it is moved to the location mentioned in the “CREATE TABLE … LOCATION“.
2. If your file contains any header row (like the sample file we have used in our example), even the header gets loaded into staging table. We cannot skip the header (see the output of the query above with header highlighted).
In order to skip header row and/or footer row, from Hive 0.13 and above, you can add this to TBLPROPERTIES of DDL command Create Table. Please refer to the following links for more information —
HIVE-5795
HIVE-10176
Here is the example for it —
CREATE EXTERNAL TABLE tbl_without_header (eid STRING, name STRING, dept STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION '/tbl_without_header'
TBLPROPERTIES ("skip.header.line.count"="1", "skip.footer.line.count"="3");
Note — I strongly suggest NOT to use this property in production. The headers/footers should be cleared as part of an ETL process before loading data into hive database.
Hope you like the post. You can post your comments/suggestions below.
if data don’t have fixed length, then what is the command
Hi. Thank you for your kind words. Since your requirement is random selection of data in a fixed file I would suggest load fixed-width file in single column (temp) table and use Hive substring to extract required fields.
For ex. “input.regex” = “(.{3})(.{10})(.{7})(.{20}).*” — here you are telling to take 1st 3chars, and then next 10chars, then 7chars, then next 20chars and so on. you cannot skip the unwanted data.
Hope this helps. Let me know if you have resolved it or got any better idea.
Hi Venakat,
Firstly this was very useful thanks.
I have a data row length = 50.
My COL 1 is between 3 and 10,
COL 2 is between 7and 20 etc..
Now how do i write my regex to extract values between those specific column numbers ignoring middle unwanted columns.
I am using something like this:
“input.regex” = “.{3}(.{10}).{7}(.{20})’.*”,
I am getting nulls
Appreciate any help on this
Thanks
Hi. Thank you for your kind words. Since your requirement is random selection of data in a fixed file I would suggest load fixed-width file in single column (temp) table and use Hive substring to extract required fields.
“input.regex” = “(.{3})(.{10})(.{7})(.{20}).*” — here you are telling to take 1st 3chars, and then next 10chars, then 7chars, then next 20chars and so on. you cannot skip the unwanted data.
Hope this helps. Let me know if you have resolved it or got any better idea.
Hi Venkat,
i want create external table on fixed with file. as you guided above i am able to create table but when i select the table column values showing as null.
Sample data:
123raju hyd 123
145satish bng 5679
Queries:
CREATE EXTERNAL TABLE IF NOT EXISTS MY_TABLE(emp_no string,name_ string,loc string,salary string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
WITH SERDEPROPERTIES (‘input.format.string’ = ‘FL3#FL#10#5FL#4’)
location ‘hdfs://bigdatalite.localdomain:8020/user/hdfs/Ganga’;
CREATE EXTERNAL TABLE IF NOT EXISTS MY_TABLE(emp_no string,name_ string,loc string,salary string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
WITH SERDEPROPERTIES (‘input.regex’ = ‘(.{3})(.{6})(.{5})(.{4})*’)
location ‘hdfs://bigdatalite.localdomain:8020/user/hdfs/Ganga’;
CREATE EXTERNAL TABLE IF NOT EXISTS MY_TABLE(emp_no string,name_ string,loc string,salary string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
WITH SERDEPROPERTIES (“input.regex” = “(.{3})(.{6})(.{5})(.{4})”,”output.format.string” = “%1$s %2$s %3$s %4$s”)
location ‘hdfs://bigdatalite.localdomain:8020/user/hdfs/Ganga’;
Hi Gangadhar,
Fixed width file is self explanatory. Data in all rows must be of fixed length i.e., if u say 600 chars is fixed then file must contain 600 chars for each row of data. You should prefix/suffix spaces or some other characters to fill the length of 600. If this is not full filled then there is no meaning calling input file as fixed width file.
Hope you got where is the issue in the file.
–Venkat
Best explanation with proper example
+—————————————+
| | |Start |End |
|Field |Length |position |position |
+—————————————+
|EID |4 |1 |4 |
|NAME |35 |5 |39 |
|AGE |3 |40 |42 |
|SALARY |11 |43 |53 |
|DEPT |4 |54 |57 |
+—————————————+
What is syntax for this?
Mack – This is file layout structure to show how the data will be like in the file.
Hey Venkat,
Even I am facing similar issue of getting null value, I have created my table and placed txt file in table path, but running select gives me null always. Please suggest.
Hi San – Can you share some sample data (couple of lines masked data) so that I can replicate and find out a solution.
Hi Venkat,
I encounter a problem when using the regex to load data to Hive.
When the fixed length file contains multibyte characters, and the length is fixed at byte level.
For example, 2 rows of the file would be (both consist of 9 bytes)
你好123
ABCDEFGHI
How can we load this kind of file?
Venkat
I am entering the query in HUE and the HUE interface is asking for properties to run the SQL.
I am using Cloudera.. do you think you can help me out?
Hi Venkat,
Thank you very much for the post.
I followed exactly your step and created a external table by using serde(org.apache.hadoop.hive.contrib.serde2.RegexSerDe). but some how all the fields are displaying nulls in the table not sure why and any help is really great .
Thanks,
Narendra
Sure. In my example I have created staging table all fields with datatype as STRING to overcome this issue. This is because you always read length on data as strings/characters. It is good practice to create a staging table and then load your raw/semi-processed data into it and then from there you dump your data into your Main/Master table. Once data is loaded into Master table you can then delete/archive the staging table.
Hope this helps.
–Venkat
Hi… Did you find a resolution for displaying nulls in the table for all the fields…
Could you please send me the sample data you are working on. Based on the data I will be in a position to explain the issue behind.
Hi
Need you help in one of the use case following SerDe not working for me please let me know if you can help.
Will post further details as required.
Sure! Please send me the details and we can discuss.
Regarding the header line, with hive 0.13 and above you can add TBLPROPERTIES (skip.header.line.count=1) to the ddl.
Thanks Bruce for appraising me.
Venkat S