Category Archives: Apache Pig

Split file into multiple files using Pig Script

We see some time there is requirement where you need to split a file into different individual files based on some key value. You can do it using Java/C/C++ or any other programming language where you write some dozens of lines of code which is fine if the file size is 1GB or less. What if the file size is greater than 1GB? This process runs forever to do the job.

On Hadoop system using Apache Pig you can write very simple code that will split file on the fly. You will have the flexibility to control flow of data and do manipulations (if any) and split file.

Now we see how to split file into individual files using Pig Script. Here is our sample file TEST.DEV.ENV.SAMPLE.FILE –

000001010K1DIBB  7RHUN  2100000AE            J82V  2269167AD         2002-03-079999-12-31+000000000100000    22004-02-28-
000002010K1DIBB  7RHUN  2100000AE            J82V  2269167AD         2002-03-072004-07-30+000000000100000    32004-02-28-
0000030108VV9IH  AKB7L  3300000AE            XMV9  2269167AD         2002-03-059999-12-31+000000000100000    22004-02-28-
0000040108VV9IH  AKB7L  3300000SE            XMV9  2269167AD         2002-03-052004-07-30+000000000100000    32004-02-28-
000005010ULY674  5XWJR  0100000SE            XMV9  2269167AD         2002-03-059999-12-31+000000000100000    22004-02-28-
000006010ULY674  5XWJR  0100000AE            XMV9  2269167AD         2002-03-052004-07-30+000000000100000    32004-02-28-
000007010QT0X36  RJPWK  5500000AE            J82V  2269167AD         2002-10-229999-12-31+000000000100000    22004-02-28-
000008010QT0X36  RJPWK  5500000AE            J82V  2269167AD         2002-10-222004-07-30+000000000100000    32004-02-28-
000009010S8LIKA  07L1X  4400000BE            J82V  2269167AD         2002-03-079999-12-31+000000000100000    22004-02-28-
000010010S8LIKA  07L1X  4400000BE            J82V  2269167AD         2002-03-072004-07-30+000000000100000    32004-02-28-
000011010QAS7G3  CO46Q  8500000BE            RI12  2269167AD         2002-03-059999-12-31+000000000100000    22004-02-28-
000012010QAS7G3  CO46Q  8500000BE            RI12  2269167AD         2002-03-052004-07-30+000000000100000    32004-02-28-

In the sample file if you see there are lot of headers and trailers and some data between them. Our requirement is to split each set of data with HEADER, TRAILER and DETAIL DATA into individual files. For our sample it should generate 6 different files.

We will split file using key values in the file. Here we use positions 11-15 (5 characters) in DETAIL DATA and positions 8-12 in HEADER and TRAILER data. We must maintain the consistency of the data as Header row, Details rows and Trailer row. This makes the complete structure of file and keeps all data together.

Now we write Pig Script to split the file –

REGISTER /home/jars/pig/piggybank.jar;

A  = LOAD  '/path/to/input/file/TEST.DEV.ENV.SAMPLE.FILE'
     USING PigStorage('\t') AS (line:chararray);

B  = FILTER A BY SUBSTRING(line, 1, 7) != 'HEADER' AND SUBSTRING(line, 0, 7) != '9TRAILR';

C  = FILTER A BY SUBSTRING(line, 1, 7) == 'HEADER' OR SUBSTRING(line, 0, 7) == '9TRAILR';

-- Extract data based on key value from Header, Details and Trailer rows 
D  = GROUP B BY SUBSTRING($0, 10, 15);
E  = GROUP C BY SUBSTRING($0, 7, 12);

F  = UNION D, E;


SPLIT G INTO H IF SIZE($0) > 0, X IF SIZE($0) <= 0;

J  = ORDER H BY $1;

STORE J INTO '/path/to/output/directory'
        -- Stores using \t as the input separator 
        USING'/path/to/output/directory', '0');

File is split on the value we taken as key value. Here our key value from all rows (Header, Details and Trailer) is 5 charcters specified using function substring() at transformtions ‘D’ and ‘E’.

Now we will see output directory for files –

I have given you the idea, rest is your imagination.