Fixed width files in Pig – Part 2

In my previous post “Fixed width files in Pig – Part 1” we have seen how to read fixed width files and load them into HDFS as tab separated dataset using static pig script. Today we will discuss how to make the pig script dynamic so that we DO NOT have to make changes every time when there is change in fields list in the input file.

Sample fixed length file (sample_file):

EID  NAME                                    AGEGSALARY     DEPT
1001 Subbayya Sivasankaranarayana Pillai      25M  425000.00HR
1002 Raj Chandra Bose                         27M  310000.00FIN
1003 Tirukkannapuram Vijayaraghavan           30M  544000.00MKT
1004 Dattaraya Ramchandra Kaprekar            21M  682345.00EDU
1005 Samarendra Nath Roy                      24M  823456.00ADM
1006 Madame Curie                             26F  723456.00SCI
1007 Rosalind Franklin                        23F  321456.00SCI

Here we have fields with the following length, start position and end position.

+---------------------------------------+
|        |        |Start     |End       |
|Field   |Length  |position  |position  |
+---------------------------------------+
|EID     |5       |1         |5         |
|NAME    |40      |6         |45        |
|AGE     |3       |46        |48        |
|GENDER  |1       |49        |49        |
|SALARY  |11      |50        |60        |
|DEPT    |4       |61        |64        |
+---------------------------------------+

Process 2: Dynamic programming ‒ Dynamically passing parameters to Pig script
To make the script dynamical we provide parameters like ‒ input path, output path, delimiter, position parameters, header option and fields list etc.. in a param-file.

1. Create a parameter file “dynmParam.param”. Add variables in it as below —

jar_file = '/path/to/piggybank.jar'
input_dir = '/path/to/input/sample_file'
pos_param = '-5, 6-45, 46-48, 49-49, 50-60, 61-'
header_option = 'SKIP_HEADER'
field_list = 'EMPLID: CHARARRAY, NAME: CHARARRAY, AGE: INT, GENDER: CHARARRAY, SALARY: DOUBLE, DEPT: CHARARRAY'
output_dir = '/path/to/output/'
output_delim = '\\t'
multi_line = 'YES_MULTILINE'
os_option = 'UNIX' 

2. Change the pig script from my previous post “Fixed width files in Pig – Part 1” and replace variable for all hardcoded values as below —

REGISTER '$jar_file';

data = load '$input_dir'
       using org.apache.pig.piggybank.storage.FixedWidthLoader(
       '$pos_param',
       '$header_option',
       '$field_list'
       );

store data into '$output_dir'
      using org.apache.pig.piggybank.storage.CSVExcelStorage('$output_delim', '$multi_line', '$os_option'); 

3. Run the pig script “readFixedFile.pig” using the below command over the console —

$ pig -m /path/to/dynmParam.param -f /path/to/readFixedFile.pig

Output file “part-m-00000”

1001	Subbayya Sivasankaranarayana Pillai	25	M	425000.0	HR
1002	Raj Chandra Bose	27	M	310000.0	FIN
1003	Tirukkannapuram Vijayaraghavan	30	M	544000.0	MKT
1004	Dattaraya Ramchandra Kaprekar	21	M	682345.0	EDU
1005	Samarendra Nath Roy	24	M	823456.0	ADM
1006	Madame Curie	26	F	723456.0	SCI
1007	Rosalind Franklin	23	F	321456.0	SCI

Note:— Make sure all relative paths are set and used in the program from whichever location the pig script is executed.

For more on this please follow the blog at Mortar Docs.

Hope you like it. You can post your comments or suggestions below.

Leave a Reply

Your email address will not be published. Required fields are marked *