Fixed width files in Hive

Today we will see how to load fixed width files into Hive database. We use SerDe properties of Hive to load fixed width files.

Initially we will create a staging table before loading fixed file data into table. We create table using SerDe properties by specifying the lengths of each column. You can refer my post Create table in Hive for syntax.

Sample fixed length file (sample_file.txt):

EID NAME                               AGESALARY     DEPT
1001Subbayya Sivasankaranarayana Pillai 25   25000.00HR  
1002Raj Chandra Bose                    27  310000.00FIN 
1003Tirukkannapuram Vijayaraghavan      30   44000.00MKT 
1004Dattaraya Ramchandra Kaprekar       21   82345.00EDU 
1005Samarendra Nath Roy                 24  123456.00ADM 
1006 Madame Curie                       26  723456.00SCI 
1007 Rosalind Franklin                  23  321456.00SCI 

Here we have fields with the following length, start position and end position.

+---------------------------------------+
|        |        |Start     |End       |
|Field   |Length  |position  |position  |
+---------------------------------------+
|EID     |4       |1         |4         |
|NAME    |35      |5         |39        |
|AGE     |3       |40        |42        |
|SALARY  |11      |43        |53        |
|DEPT    |4       |54        |57        |
+---------------------------------------+

Following is the create table for fixed width file:—

CREATE TABLE employees_stg (emplid STRING, name STRING, age INT, salary DOUBLE, dept STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
     "input.regex" = "(.{4})(.{35})(.{3})(.{11})(.{4})", 
     "output.format.string" = "%1$s %2$s %3$s %4$s %5$s"
     )
LOCATION '/path/to/input/employees_stg';

The above query will throw error —

Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.contrib.serde2.RegexSerDe only accepts string columns, but column[2] named age has type int)

We should create table with all columns with data type as string. All other data types will throw error as above.

CREATE TABLE employees_stg (emplid STRING, name STRING, age STRING, salary STRING, dept STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
     "input.regex" = "(.{4})(.{35})(.{3})(.{11})(.{4})",     --Length of each column specified between braces "({})"
     "output.format.string" = "%1$s %2$s %3$s %4$s %5$s"     --Output in string format
     )
LOCATION '/path/to/input/employees_stg';

LOAD DATA INPATH '/path/to/sample_file.txt' INTO TABLE employees_stg;

SELECT * FROM employees_stg;
EID     NAME                                    AGE     SALARY          DEPT
1001    Subbayya Sivasankaranarayana Pillai      25        25000.00     HR  
1002    Raj Chandra Bose                         27       310000.00     FIN 
1003    Tirukkannapuram Vijayaraghavan           30        44000.00     MKT 
1004    Dattaraya Ramchandra Kaprekar            21        82345.00     EDU 
1005    Samarendra Nath Roy                      24       123456.00     ADM 
1006    Madame Curie                             26       723456.00     SCI 
1007    Rosalind Franklin                        23       321456.00     SCI 
Time taken: 0.063 seconds, Fetched: 8 row(s)

Disadvantages —
1. If you have noticed, when we say LOAD DATA INPATH … input file is NOT copied and instead it is moved to the location mentioned in the CREATE TABLE … LOCATION.
2. If your file contains any header row (like the sample file we have used in our example), even the header gets loaded into staging table. We cannot skip the header (see the output of the query above with header highlighted).

In order to skip header row and/or footer row, from Hive 0.13 and above, you can add this to TBLPROPERTIES of DDL command Create Table. Please refer to the following links for more information —
HIVE-5795
HIVE-10176

Here is the example for it —

CREATE EXTERNAL TABLE tbl_without_header (eid STRING, name STRING, dept STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION '/tbl_without_header'
TBLPROPERTIES ("skip.header.line.count"="1", "skip.footer.line.count"="3");

Note — I strongly suggest NOT to use this property in production. The headers/footers should be cleared as part of an ETL process before loading data into hive database.

Hope you like the post. You can post your comments/suggestions below.

21 thoughts on “Fixed width files in Hive”

    1. Hi. Thank you for your kind words. Since your requirement is random selection of data in a fixed file I would suggest load fixed-width file in single column (temp) table and use Hive substring to extract required fields.
      For ex. “input.regex” = “(.{3})(.{10})(.{7})(.{20}).*” — here you are telling to take 1st 3chars, and then next 10chars, then 7chars, then next 20chars and so on. you cannot skip the unwanted data.
      Hope this helps. Let me know if you have resolved it or got any better idea.

  1. Hi Venakat,
    Firstly this was very useful thanks.
    I have a data row length = 50.
    My COL 1 is between 3 and 10,
    COL 2 is between 7and 20 etc..
    Now how do i write my regex to extract values between those specific column numbers ignoring middle unwanted columns.
    I am using something like this:
    “input.regex” = “.{3}(.{10}).{7}(.{20})’.*”,
    I am getting nulls
    Appreciate any help on this
    Thanks

    1. Hi. Thank you for your kind words. Since your requirement is random selection of data in a fixed file I would suggest load fixed-width file in single column (temp) table and use Hive substring to extract required fields.
      “input.regex” = “(.{3})(.{10})(.{7})(.{20}).*” — here you are telling to take 1st 3chars, and then next 10chars, then 7chars, then next 20chars and so on. you cannot skip the unwanted data.
      Hope this helps. Let me know if you have resolved it or got any better idea.

  2. Hi Venkat,
    i want create external table on fixed with file. as you guided above i am able to create table but when i select the table column values showing as null.

    Sample data:
    123raju hyd 123
    145satish bng 5679

    Queries:
    CREATE EXTERNAL TABLE IF NOT EXISTS MY_TABLE(emp_no string,name_ string,loc string,salary string)
    ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
    WITH SERDEPROPERTIES (‘input.format.string’ = ‘FL3#FL#10#5FL#4’)
    location ‘hdfs://bigdatalite.localdomain:8020/user/hdfs/Ganga’;

    CREATE EXTERNAL TABLE IF NOT EXISTS MY_TABLE(emp_no string,name_ string,loc string,salary string)
    ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
    WITH SERDEPROPERTIES (‘input.regex’ = ‘(.{3})(.{6})(.{5})(.{4})*’)
    location ‘hdfs://bigdatalite.localdomain:8020/user/hdfs/Ganga’;

    CREATE EXTERNAL TABLE IF NOT EXISTS MY_TABLE(emp_no string,name_ string,loc string,salary string)
    ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
    WITH SERDEPROPERTIES (“input.regex” = “(.{3})(.{6})(.{5})(.{4})”,”output.format.string” = “%1$s %2$s %3$s %4$s”)
    location ‘hdfs://bigdatalite.localdomain:8020/user/hdfs/Ganga’;

    1. Hi Gangadhar,
      Fixed width file is self explanatory. Data in all rows must be of fixed length i.e., if u say 600 chars is fixed then file must contain 600 chars for each row of data. You should prefix/suffix spaces or some other characters to fill the length of 600. If this is not full filled then there is no meaning calling input file as fixed width file.
      Hope you got where is the issue in the file.
      –Venkat

  3. +—————————————+
    | | |Start |End |
    |Field |Length |position |position |
    +—————————————+
    |EID |4 |1 |4 |
    |NAME |35 |5 |39 |
    |AGE |3 |40 |42 |
    |SALARY |11 |43 |53 |
    |DEPT |4 |54 |57 |
    +—————————————+

    What is syntax for this?

  4. Hey Venkat,
    Even I am facing similar issue of getting null value, I have created my table and placed txt file in table path, but running select gives me null always. Please suggest.

    1. Hi San – Can you share some sample data (couple of lines masked data) so that I can replicate and find out a solution.

  5. Hi Venkat,
    I encounter a problem when using the regex to load data to Hive.
    When the fixed length file contains multibyte characters, and the length is fixed at byte level.
    For example, 2 rows of the file would be (both consist of 9 bytes)
    你好123
    ABCDEFGHI

    How can we load this kind of file?

  6. Venkat
    I am entering the query in HUE and the HUE interface is asking for properties to run the SQL.
    I am using Cloudera.. do you think you can help me out?

  7. Hi Venkat,
    Thank you very much for the post.
    I followed exactly your step and created a external table by using serde(org.apache.hadoop.hive.contrib.serde2.RegexSerDe). but some how all the fields are displaying nulls in the table not sure why and any help is really great .

    Thanks,
    Narendra

    1. Sure. In my example I have created staging table all fields with datatype as STRING to overcome this issue. This is because you always read length on data as strings/characters. It is good practice to create a staging table and then load your raw/semi-processed data into it and then from there you dump your data into your Main/Master table. Once data is loaded into Master table you can then delete/archive the staging table.

      Hope this helps.

      –Venkat

      1. Could you please send me the sample data you are working on. Based on the data I will be in a position to explain the issue behind.

  8. Hi
    Need you help in one of the use case following SerDe not working for me please let me know if you can help.
    Will post further details as required.

  9. Regarding the header line, with hive 0.13 and above you can add TBLPROPERTIES (skip.header.line.count=1) to the ddl.

Leave a Reply

Your email address will not be published. Required fields are marked *