Fixed width files in Pig – Part 1

Today we will see how to load fixed length file data into HDFS using Apache Pig.

Sample fixed length file (sample_file):

EID  NAME                                    AGEGSALARY     DEPT
1001 Subbayya Sivasankaranarayana Pillai      25M  425000.00HR
1002 Raj Chandra Bose                         27M  310000.00FIN
1003 Tirukkannapuram Vijayaraghavan           30M  544000.00MKT
1004 Dattaraya Ramchandra Kaprekar            21M  682345.00EDU
1005 Samarendra Nath Roy                      24M  823456.00ADM
1006 Madame Curie                             26F  723456.00SCI
1007 Rosalind Franklin                        23F  321456.00SCI

Here we have fields with the following length, start position and end position.

+---------------------------------------+
|        |        |Start     |End       |
|Field   |Length  |position  |position  |
+---------------------------------------+
|EID     |5       |1         |5         |
|NAME    |40      |6         |45        |
|AGE     |3       |46        |48        |
|GENDER  |1       |49        |49        |
|SALARY  |11      |50        |60        |
|DEPT    |4       |61        |64        |
+---------------------------------------+

We achieve this through 2 processes —
1. Static programming:
Static programs are very rigid in nature and we have to make changes to it whenever there is a small change in the process.
2. Dynamic programming: These programs are re-usable. Create once and can be used differently with different parameters. You don’t have to change the program unless if there is any major change.

Using classes “FixedWidthLoader” and “CSVExcelStorage” from Jar file “piggybak.jar” we can read and convert file to any delimited formatted dataset.

Process 1: Static programming ‒ Hardcoded parameters in Pig script

REGISTER '/path/to/piggybank.jar';

data = load '/path/to/input/sample_file'
       using org.apache.pig.piggybank.storage.FixedWidthLoader(
       '-5, 6-45, 46-48, 49-49, 50-60, 61-',     --Hardcoding of position parameters
       'SKIP_HEADER',
       'EMPLID: CHARARRAY, NAME: CHARARRAY, AGE: INT, GENDER: CHARARRAY, SALARY: DOUBLE, DEPT: CHARARRAY'
       );

store data into '/path/to/output/tab_saparated'
      using org.apache.pig.piggybank.storage.CSVExcelStorage('\t', 'YES_MULTILINE', 'UNIX');

Run the pig script “readFixedFile.pig” using the below command over the console —

$ pig -f /path/to/readFixedFile.pig

Output file “part-m-00000”

1001	Subbayya Sivasankaranarayana Pillai	25	M	425000.0	HR
1002	Raj Chandra Bose	27	M	310000.0	FIN
1003	Tirukkannapuram Vijayaraghavan	30	M	544000.0	MKT
1004	Dattaraya Ramchandra Kaprekar	21	M	682345.0	EDU
1005	Samarendra Nath Roy	24	M	823456.0	ADM
1006	Madame Curie	26	F	723456.0	SCI
1007	Rosalind Franklin	23	F	321456.0	SCI

Step 1 – Register the jar file piggybank.jar
Step 2 – Use FixedWidthLoader() from it in the Load command
Step 3 – Specify each field start position and end position in the function FixedWidthLoader()
Step 4 – Provide Field positions: ‘-5, 6-45, 46-48, 49-49, 50-60, 61-‘
The first parameter is mandatory and specifies the positions of the columns. They are indexed and inclusive on both ends. “-5” means columns 1 through 5, and “61-“ means 61 to the end of the line. Single-character columns at position n can be specified as either n-n or simply n.
Step 5 – To skip header if file has header row (OPTIONAL): ‘SKIP_HEADER’
This parameter is optional and specifies what to do with header rows (a first row containing the titles of each column). If the parameter is set to ‘SKIP_HEADER’, FixedWidthLoader will skip the header row of each input file. The default behavior is to not skip the header; if you need to explicitly state this, set the parameter to ‘USE_HEADER’.

Process 2: Dynamic programming ‒ Dynamically passing parameters to Pig script.
See Fixed width files in Pig – Part 2 for Dynamic pig latin script.
 
Note:— Make sure all relative paths are set and used in the program from whichever location the pig script is executed.
 
For more on this please follow the blog at Mortar Docs.
 
Hope you like it. You can post your comments or suggestions below.

4 thoughts on “Fixed width files in Pig – Part 1”

  1. Excellent information!!!! But I have a doubt:
    How to save data processed on HDFS on folder by date, like:
    The process was executed on 2017-02-03 and the data processed will be on path/to/output/2017-02-03..
    Tomorrow I will process new data, and save on:
    path/to/output/2017-02-04…
    Thanks in advance!

  2. Hello,
    i am new to Apache Pig.
    Is there a possibility to use different fixed-width for each line?
    For Example
    First Line use 1-5, 9-10, … (AS First Name; Last Name…)
    Second Line use 1-5, 6-16 … (AS Amount;IBAN…) and so on.

    Also is there a possibility to add an IF-Statement before each Map like:
    If(1-5) = string then use first line map
    If(1-5) = Int then use second line map

    I am grateful for every help you can give me, i am trying to realise a solution in Hortonworks (Apache PIG) and I am not a professional Coder.
    I am willing to reward you for any help, that leads to a functional version.

    Best Regards
    Constantin

Leave a Reply

Your email address will not be published. Required fields are marked *