We see some time there is requirement where you need to split a file into different individual files based on some key value. You can do it using Java/C/C++ or any other programming language where you write some dozens of lines of code which is fine if the file size is 1GB or less. What if the file size is greater than 1GB? This process runs forever to do the job.
On Hadoop system using Apache Pig you can write very simple code that will split file on the fly. You will have the flexibility to control flow of data and do manipulations (if any) and split file.
Now we see how to split file into individual files using Pig Script. Here is our sample file TEST.DEV.ENV.SAMPLE.FILE –
HEADER1DIBBCCLY8-9568347556434756972CMMS21WUE 000001010K1DIBB 7RHUN 2100000AE J82V 2269167AD 2002-03-079999-12-31+000000000100000 22004-02-28-20.00.13.106749 000002010K1DIBB 7RHUN 2100000AE J82V 2269167AD 2002-03-072004-07-30+000000000100000 32004-02-28-20.00.13.106749 9TRAILR1DIBBCCLY8-95683475564347560000084 HEADERVV9IHFYSKN-4654178251104433898CMMS21ANI 0000030108VV9IH AKB7L 3300000AE XMV9 2269167AD 2002-03-059999-12-31+000000000100000 22004-02-28-20.00.13.106749 0000040108VV9IH AKB7L 3300000SE XMV9 2269167AD 2002-03-052004-07-30+000000000100000 32004-02-28-20.00.13.106749 9TRAILRVV9IHFYSKN-46541782511044330000510 HEADERLY674FBNR4-9375012333185998800CMMS21AIZ 000005010ULY674 5XWJR 0100000SE XMV9 2269167AD 2002-03-059999-12-31+000000000100000 22004-02-28-20.00.13.106749 000006010ULY674 5XWJR 0100000AE XMV9 2269167AD 2002-03-052004-07-30+000000000100000 32004-02-28-20.00.13.106749 9TRAILRLY674FBNR4-93750123331859980000150 HEADERT0X36Q6YVQ-5632769394873798290CMMS21WLO 000007010QT0X36 RJPWK 5500000AE J82V 2269167AD 2002-10-229999-12-31+000000000100000 22004-02-28-20.00.13.106749 000008010QT0X36 RJPWK 5500000AE J82V 2269167AD 2002-10-222004-07-30+000000000100000 32004-02-28-20.00.13.106749 9TRAILRT0X36Q6YVQ-56327693948737980000642 HEADER8LIKAC67U9-2737265552238819829CMMS21HMV 000009010S8LIKA 07L1X 4400000BE J82V 2269167AD 2002-03-079999-12-31+000000000100000 22004-02-28-20.00.13.106749 000010010S8LIKA 07L1X 4400000BE J82V 2269167AD 2002-03-072004-07-30+000000000100000 32004-02-28-20.00.13.106749 9TRAILR8LIKAC67U9-27372655522388190000412 HEADERAS7G3QPIUC-8825934656338659366CMMS21BQA 000011010QAS7G3 CO46Q 8500000BE RI12 2269167AD 2002-03-059999-12-31+000000000100000 22004-02-28-20.00.13.106749 000012010QAS7G3 CO46Q 8500000BE RI12 2269167AD 2002-03-052004-07-30+000000000100000 32004-02-28-20.00.13.106749 9TRAILRAS7G3QPIUC-88259346563386590000865
In the sample file if you see there are lot of headers and trailers and some data between them. Our requirement is to split each set of data with HEADER, TRAILER and DETAIL DATA into individual files. For our sample it should generate 6 different files.
We will split file using key values in the file. Here we use positions 11-15 (5 characters) in DETAIL DATA and positions 8-12 in HEADER and TRAILER data. We must maintain the consistency of the data as Header row, Details rows and Trailer row. This makes the complete structure of file and keeps all data together.
Now we write Pig Script to split the file –
REGISTER /home/jars/pig/piggybank.jar;
A = LOAD '/path/to/input/file/TEST.DEV.ENV.SAMPLE.FILE'
USING PigStorage('\t') AS (line:chararray);
B = FILTER A BY SUBSTRING(line, 1, 7) != 'HEADER' AND SUBSTRING(line, 0, 7) != '9TRAILR';
C = FILTER A BY SUBSTRING(line, 1, 7) == 'HEADER' OR SUBSTRING(line, 0, 7) == '9TRAILR';
-- Extract data based on key value from Header, Details and Trailer rows
D = GROUP B BY SUBSTRING($0, 10, 15);
E = GROUP C BY SUBSTRING($0, 7, 12);
F = UNION D, E;
G = FOREACH F GENERATE FLATTEN($0), FLATTEN($1);
SPLIT G INTO H IF SIZE($0) > 0, X IF SIZE($0) <= 0;
J = ORDER H BY $1;
STORE J INTO '/path/to/output/directory'
-- Stores using \t as the input separator
USING org.apache.pig.piggybank.storage.MultiStorage('/path/to/output/directory', '0');
File is split on the value we taken as key value. Here our key value from all rows (Header, Details and Trailer) is 5 charcters specified using function substring() at transformtions ‘D’ and ‘E’.
Now we will see output directory for files –
I have given you the idea, rest is your imagination.