Today we will see how to load fixed length file data into HDFS using Apache Pig.
Sample fixed length file (sample_file):
EID NAME AGEGSALARY DEPT 1001 Subbayya Sivasankaranarayana Pillai 25M 425000.00HR 1002 Raj Chandra Bose 27M 310000.00FIN 1003 Tirukkannapuram Vijayaraghavan 30M 544000.00MKT 1004 Dattaraya Ramchandra Kaprekar 21M 682345.00EDU 1005 Samarendra Nath Roy 24M 823456.00ADM 1006 Madame Curie 26F 723456.00SCI 1007 Rosalind Franklin 23F 321456.00SCI
Here we have fields with the following length, start position and end position.
+---------------------------------------+ | | |Start |End | |Field |Length |position |position | +---------------------------------------+ |EID |5 |1 |5 | |NAME |40 |6 |45 | |AGE |3 |46 |48 | |GENDER |1 |49 |49 | |SALARY |11 |50 |60 | |DEPT |4 |61 |64 | +---------------------------------------+
We achieve this through 2 processes —
1. Static programming: Static programs are very rigid in nature and we have to make changes to it whenever there is a small change in the process.
2. Dynamic programming: These programs are re-usable. Create once and can be used differently with different parameters. You don’t have to change the program unless if there is any major change.
Using classes “FixedWidthLoader” and “CSVExcelStorage” from Jar file “piggybak.jar” we can read and convert file to any delimited formatted dataset.
Process 1: Static programming ‒ Hardcoded parameters in Pig script
REGISTER '/path/to/piggybank.jar'; data = load '/path/to/input/sample_file' using org.apache.pig.piggybank.storage.FixedWidthLoader( '-5, 6-45, 46-48, 49-49, 50-60, 61-', --Hardcoding of position parameters 'SKIP_HEADER', 'EMPLID: CHARARRAY, NAME: CHARARRAY, AGE: INT, GENDER: CHARARRAY, SALARY: DOUBLE, DEPT: CHARARRAY' ); store data into '/path/to/output/tab_saparated' using org.apache.pig.piggybank.storage.CSVExcelStorage('\t', 'YES_MULTILINE', 'UNIX');
Run the pig script “readFixedFile.pig” using the below command over the console —
$ pig -f /path/to/readFixedFile.pig
Output file “part-m-00000”
1001 Subbayya Sivasankaranarayana Pillai 25 M 425000.0 HR 1002 Raj Chandra Bose 27 M 310000.0 FIN 1003 Tirukkannapuram Vijayaraghavan 30 M 544000.0 MKT 1004 Dattaraya Ramchandra Kaprekar 21 M 682345.0 EDU 1005 Samarendra Nath Roy 24 M 823456.0 ADM 1006 Madame Curie 26 F 723456.0 SCI 1007 Rosalind Franklin 23 F 321456.0 SCI
Step 1 – Register the jar file piggybank.jar
Step 2 – Use FixedWidthLoader() from it in the Load command
Step 3 – Specify each field start position and end position in the function FixedWidthLoader()
Step 4 – Provide Field positions: ‘-5, 6-45, 46-48, 49-49, 50-60, 61-‘
The first parameter is mandatory and specifies the positions of the columns. They are indexed and inclusive on both ends. “-5” means columns 1 through 5, and “61-“ means 61 to the end of the line. Single-character columns at position n can be specified as either n-n or simply n.
Step 5 – To skip header if file has header row (OPTIONAL): ‘SKIP_HEADER’
This parameter is optional and specifies what to do with header rows (a first row containing the titles of each column). If the parameter is set to ‘SKIP_HEADER’, FixedWidthLoader will skip the header row of each input file. The default behavior is to not skip the header; if you need to explicitly state this, set the parameter to ‘USE_HEADER’.
Process 2: Dynamic programming ‒ Dynamically passing parameters to Pig script.
See Fixed width files in Pig – Part 2 for Dynamic pig latin script.
Note:— Make sure all relative paths are set and used in the program from whichever location the pig script is executed.
For more on this please follow the blog at Mortar Docs.
Hope you like it. You can post your comments or suggestions below.
Excellent information!!!! But I have a doubt:
How to save data processed on HDFS on folder by date, like:
The process was executed on 2017-02-03 and the data processed will be on path/to/output/2017-02-03..
Tomorrow I will process new data, and save on:
path/to/output/2017-02-04…
Thanks in advance!
Hi,
You can follow my blog on similar requirement of yours
http://www.hadooptechs.com/pig/split-single-file-into-multiple-files-using-pig-script
Regards..
Hello,
i am new to Apache Pig.
Is there a possibility to use different fixed-width for each line?
For Example
First Line use 1-5, 9-10, … (AS First Name; Last Name…)
Second Line use 1-5, 6-16 … (AS Amount;IBAN…) and so on.
Also is there a possibility to add an IF-Statement before each Map like:
If(1-5) = string then use first line map
If(1-5) = Int then use second line map
I am grateful for every help you can give me, i am trying to realise a solution in Hortonworks (Apache PIG) and I am not a professional Coder.
I am willing to reward you for any help, that leads to a functional version.
Best Regards
Constantin
Hi. I never got this kind of requirement. Good one. Will try it out. Thanks for writing.