Reading JSON file in Pig

Today we will see how to read schema less JSON files in Pig. To read JSON files we will be working with the following Jar files


Best sample JSON file for testing this is to download tweets from
findString.jar is the custom UDF written in java. This UDF is similar to instr() function in Hive.

REGISTER json-simple-1.1.1.jar;
REGISTER elephant-bird-hadoop-compat-4.3.jar;
REGISTER elephant-bird-pig-4.3.jar;
REGISTER findString.jar;

DEFINE InString findString.findString();

A = LOAD '/tmp/jsonInput/tweets.json'
    USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') As jsonMap;

B = FOREACH A GENERATE jsonMap#'id'           As tweetId,
                       jsonMap#'timestamp_ms' As tweetDateTime,
                       jsonMap#'text'         As tweetComments,
                       jsonMap#'lang'         As tweetLanguage,
                       jsonMap#'source'       As tweetSource,
                       jsonMap#'user'         As tweetUser;

C = FILTER B By ( (chararray)tweetLanguage == 'en' );

D = ORDER C By tweetId, tweetDateTime DESC;

E = FOREACH D GENERATE (long)tweetId,
                       (chararray)tweetUser#'screen_name' As screenName,
                       REPLACE(REPLACE(ToString(ToDate((long)tweetDateTime)), 'T', ' '), 'Z', ''),
                       CASE InString((chararray)tweetSource, 'TWITTER FOR ANDROID')
                          WHEN 'True' THEN 'Twitter for Android'
                             CASE InString((chararray)tweetSource, 'TWITTER FOR IPHONE')
                                WHEN 'True' THEN 'Twitter for iPhone'
                                   CASE InString((chararray)tweetSource, 'TWITTER FOR WINDOWS')
                                      WHEN 'True' THEN 'Twitter for Windows'
                                      ELSE 'Browser'
                       REPLACE((chararray)tweetComments, '\n', ''),
                       (chararray)tweetUser#'friends_count' As friendsCount,
                       (chararray)tweetUser#'followers_count' As followersCount,

STORE E INTO '/tmp/jsonOutput/tweetsEng';

Log stats of the above script.

2015-12-20 17:50:17,082 [main] INFO - Script Statistics: 

HadoopVersion	PigVersion	UserId	StartedAt	FinishedAt	Features	yarn	2015-12-20 17:46:13	2015-12-20 17:50:17	ORDER_BY,FILTER


Job Stats (time in seconds):
JobId			Maps	Reduces	MaxMapTime	MinMapTime	AvgMapTime	MedianMapTime	MaxReduceTime	MinReduceTime	AvgReduceTime	MedianReducetime	Alias	Feature	Outputs
job_1455173921266_0005	12	0	72		53		65		66		0		0		0		0			A,B,C	MAP_ONLY	
job_1455173921266_0006	2	1	18		15		17		17		8		8		8		8			E	SAMPLER	
job_1455173921266_0007	2	1	10		7		9		9		17		17		17		17			E,F	ORDER_BY	hdfs://,

Successfully read 346871 records (1540293485 bytes) from: "/tmp/jsonInput/tweets.json"

Successfully stored 91168 records (16290719 bytes) in: "hdfs://"

Total records written : 91168
Total bytes written : 16290719
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Output result of the above pig script —

686906039862689792	rafie2012	2015-12-12 13:42:16.657	Browser			Mindfulness My Way: kindle free download this week only: https:\/\/\/W00SlodXo3	1433	429	en
686906048251170816	ADCtrash	2015-12-12 13:42:18.657	Twitter for Android	RT @ravensheda: season 3 is all i can think about pleathe im not ready	79	171	en
686906039875301376	overcome_16	2015-12-12 13:42:16.660	Twitter for Android	Comeback de 9Muses\ud83d\ude0d	500	247	en
686906039883665408	slha28241	2015-12-12 13:42:16.662	Browser			Get Weather Updates from The Weather Channel. 08:42:16	40	4	en
686906039883698176	omega_soft	2015-12-12 13:42:16.662	Browser			Just beautiful https:\/\/\/x4dZkLl9Zg	19	36	en
686906044090421248	daine_mariee	2015-12-12 13:42:17.665	Twitter for iPhone	RT @textposts: Please understand that I'm trying	201	180	en

Hope you like the post. You can post your comments/suggestions below.

    1. Please use “string-1.1.0.jar” instead of “findString.jar”. Seems like the jar file has been renamed. This jar is used to do string operations on a text.

