In my previous post “Fixed width files in Pig – Part 1” we have seen how to read fixed width files and load them into HDFS as tab separated dataset using static pig script. Today we will discuss how to make the pig script dynamic so that we DO NOT have to make changes every time when there is change in fields list in the input file.
Sample fixed length file (sample_file):
EID NAME AGEGSALARY DEPT 1001 Subbayya Sivasankaranarayana Pillai 25M 425000.00HR 1002 Raj Chandra Bose 27M 310000.00FIN 1003 Tirukkannapuram Vijayaraghavan 30M 544000.00MKT 1004 Dattaraya Ramchandra Kaprekar 21M 682345.00EDU 1005 Samarendra Nath Roy 24M 823456.00ADM 1006 Madame Curie 26F 723456.00SCI 1007 Rosalind Franklin 23F 321456.00SCI
Here we have fields with the following length, start position and end position.
+---------------------------------------+ | | |Start |End | |Field |Length |position |position | +---------------------------------------+ |EID |5 |1 |5 | |NAME |40 |6 |45 | |AGE |3 |46 |48 | |GENDER |1 |49 |49 | |SALARY |11 |50 |60 | |DEPT |4 |61 |64 | +---------------------------------------+
Process 2: Dynamic programming ‒ Dynamically passing parameters to Pig script
To make the script dynamical we provide parameters like ‒ input path, output path, delimiter, position parameters, header option and fields list etc.. in a param-file.
1. Create a parameter file “dynmParam.param”. Add variables in it as below —
jar_file = '/path/to/piggybank.jar' input_dir = '/path/to/input/sample_file' pos_param = '-5, 6-45, 46-48, 49-49, 50-60, 61-' header_option = 'SKIP_HEADER' field_list = 'EMPLID: CHARARRAY, NAME: CHARARRAY, AGE: INT, GENDER: CHARARRAY, SALARY: DOUBLE, DEPT: CHARARRAY' output_dir = '/path/to/output/' output_delim = '\\t' multi_line = 'YES_MULTILINE' os_option = 'UNIX'
2. Change the pig script from my previous post “Fixed width files in Pig – Part 1” and replace variable for all hardcoded values as below —
REGISTER '$jar_file'; data = load '$input_dir' using org.apache.pig.piggybank.storage.FixedWidthLoader( '$pos_param', '$header_option', '$field_list' ); store data into '$output_dir' using org.apache.pig.piggybank.storage.CSVExcelStorage('$output_delim', '$multi_line', '$os_option');
3. Run the pig script “readFixedFile.pig” using the below command over the console —
$ pig -m /path/to/dynmParam.param -f /path/to/readFixedFile.pig
Output file “part-m-00000”
1001 Subbayya Sivasankaranarayana Pillai 25 M 425000.0 HR 1002 Raj Chandra Bose 27 M 310000.0 FIN 1003 Tirukkannapuram Vijayaraghavan 30 M 544000.0 MKT 1004 Dattaraya Ramchandra Kaprekar 21 M 682345.0 EDU 1005 Samarendra Nath Roy 24 M 823456.0 ADM 1006 Madame Curie 26 F 723456.0 SCI 1007 Rosalind Franklin 23 F 321456.0 SCI
Note:— Make sure all relative paths are set and used in the program from whichever location the pig script is executed.
For more on this please follow the blog at Mortar Docs.
Hope you like it. You can post your comments or suggestions below.