Today we will see how to read schema less JSON files in Pig. To read JSON files we will be working with the following Jar files—
json-simple-1.1.1.jar; elephant-bird-hadoop-compat-4.3.jar; elephant-bird-pig-4.3.jar; findString.jar;
Best sample JSON file for testing this is to download tweets from twitter.com.
findString.jar is the custom UDF written in java. This UDF is similar to instr() function in Hive.
REGISTER json-simple-1.1.1.jar; REGISTER elephant-bird-hadoop-compat-4.3.jar; REGISTER elephant-bird-pig-4.3.jar; REGISTER findString.jar; DEFINE InString findString.findString(); A = LOAD '/tmp/jsonInput/tweets.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') As jsonMap; B = FOREACH A GENERATE jsonMap#'id' As tweetId, jsonMap#'timestamp_ms' As tweetDateTime, jsonMap#'text' As tweetComments, jsonMap#'lang' As tweetLanguage, jsonMap#'source' As tweetSource, jsonMap#'user' As tweetUser; C = FILTER B By ( (chararray)tweetLanguage == 'en' ); D = ORDER C By tweetId, tweetDateTime DESC; E = FOREACH D GENERATE (long)tweetId, (chararray)tweetUser#'screen_name' As screenName, REPLACE(REPLACE(ToString(ToDate((long)tweetDateTime)), 'T', ' '), 'Z', ''), CASE InString((chararray)tweetSource, 'TWITTER FOR ANDROID') WHEN 'True' THEN 'Twitter for Android' ELSE CASE InString((chararray)tweetSource, 'TWITTER FOR IPHONE') WHEN 'True' THEN 'Twitter for iPhone' ELSE CASE InString((chararray)tweetSource, 'TWITTER FOR WINDOWS') WHEN 'True' THEN 'Twitter for Windows' ELSE 'Browser' END END END, REPLACE((chararray)tweetComments, '\n', ''), (chararray)tweetUser#'friends_count' As friendsCount, (chararray)tweetUser#'followers_count' As followersCount, (chararray)tweetLanguage; STORE E INTO '/tmp/jsonOutput/tweetsEng';
Log stats of the above script.
2015-12-20 17:50:17,082 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 2.7.1.2.3.0.0-2557 0.15.0.2.3.0.0-2557 yarn 2015-12-20 17:46:13 2015-12-20 17:50:17 ORDER_BY,FILTER Success! Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs job_1455173921266_0005 12 0 72 53 65 66 0 0 0 0 A,B,C MAP_ONLY job_1455173921266_0006 2 1 18 15 17 17 8 8 8 8 E SAMPLER job_1455173921266_0007 2 1 10 7 9 9 17 17 17 17 E,F ORDER_BY hdfs://sandbox.hortonworks.com:8020/tmp/jsonOutput/tweetsEng, Input(s): Successfully read 346871 records (1540293485 bytes) from: "/tmp/jsonInput/tweets.json" Output(s): Successfully stored 91168 records (16290719 bytes) in: "hdfs://sandbox.hortonworks.com:8020/tmp/jsonOutput/tweetsEng" Counters: Total records written : 91168 Total bytes written : 16290719 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0
Output result of the above pig script —
686906039862689792 rafie2012 2015-12-12 13:42:16.657 Browser Mindfulness My Way: kindle free download this week only: https:\/\/t.co\/W00SlodXo3 1433 429 en 686906048251170816 ADCtrash 2015-12-12 13:42:18.657 Twitter for Android RT @ravensheda: season 3 is all i can think about pleathe im not ready 79 171 en 686906039875301376 overcome_16 2015-12-12 13:42:16.660 Twitter for Android Comeback de 9Muses\ud83d\ude0d 500 247 en 686906039883665408 slha28241 2015-12-12 13:42:16.662 Browser Get Weather Updates from The Weather Channel. 08:42:16 40 4 en 686906039883698176 omega_soft 2015-12-12 13:42:16.662 Browser Just beautiful https:\/\/t.co\/x4dZkLl9Zg 19 36 en 686906044090421248 daine_mariee 2015-12-12 13:42:17.665 Twitter for iPhone RT @textposts: Please understand that I'm trying 201 180 en
Hope you like the post. You can post your comments/suggestions below.
what does findString.jar do? can u share the code of that too.. also pls show the input data also.
Please use “string-1.1.0.jar” instead of “findString.jar”. Seems like the jar file has been renamed. This jar is used to do string operations on a text.
You can try usin the twitter elephantbird json loader , It handles the json data dynamically.