在Pig Latin中仅加载复杂JSON对象中的几个值

时间:2015-01-12 10:42:20

标签: json apache-pig

我有一个复杂的JSON文件,如下所示:http://pastebin.com/4UfadbqS

我想使用Pig Latin从这些JSON对象中仅加载几个值。我试着这样做:

mydata = LOAD 'data.json' 
    USING JsonLoader('id:chararray, created_at:chararray, 
                      user: {(language:chararray)}’);
STORE mydata 
    INTO 'output';

但似乎Pig Latin只是从JSON中取出前3个值并保存它们(它不会将列名识别为键)。有没有办法实现这个目标?或者,我应该只列出Pig中JSON的所有值并在之后过滤它们吗?

1 个答案:

答案 0 :(得分:2)

上述方法中存在的问题很少 1. JsonLoader将始终期望输入的完整模式,但您只给出了三个字段 2. JsonLoader总是希望整个输入为单行,但输入是多行的 3. JsonLoader不支持嵌套模式,但您的输入包含嵌套模式。

要解决上述所有问题,请使用第三方库elephant-bird jar 从此链接下载( elephant-bird-pig-4.1.jar和elephant-bird-hadoop-compat-4.1.jar )jar文件 http://www.java2s.com/Code/Jar/e/elephant.htm并尝试以下方法

我复制了您的整个输入并将其格式化为单行,如下所示。

<强> input.json

{"filter_level":"medium","retweeted":false,"in_reply_to_screen_name":null,"possibly_sensitive":false,"truncated":false,"lang":"en","in_reply_to_status_id_str":null,"id":488927960280211456,"in_reply_to_user_id_str":null,"in_reply_to_status_id":null,"created_at":"Tue Jul 15 06:08:04 +0000 2014","favorite_count":0,"place":null,"coordinates":null,"text":"RT @BulleyBufton: @MinaANDMaya PLEASE RT /VOTE BULLEY. Last day to help me win my old rescue @HilbraesDogs £5k https://t.co/Y8g47fLYY1 http\u2026","contributors":null,"retweeted_stt
atus":{"filter_level":"low","contributors":null,"text":"@MinaANDMaya PLEASE RT /VOTE BULLEY. Last day to help me win my old rescue @HilbraesDogs £5k https://t.co/Y8g47fLYY1 httpp
://t.co/DDco9wVXtP","geo":null,"retweeted":false,"in_reply_to_screen_name":"MinaANDMaya","possibly_sensitive":false,"truncated":false,"lang":"en","entities":{"trends":[],"symbols":[],"urls":[{"expanded_url":"https://www.animalfriendsquote.co.uk/fb-worldcup/","indices":[93,116],"display_url":"animalfriendsquote.co.uk/fb-worldcup/","url":"https://t.co/Y8g47fLYY1"}],"hashtags":[],"media":[{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":455},"large":{"w":706,"resize":"fit","h":946},"medium":{"w":600,"resize":"fit","h":803}},"id":488926730481332224,"media_url_https":"https://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","media_url":"http://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","expanded_url":"http://twitter.com/BulleyBufton/status/488926827394904064/photo/1","indices":[117,139],"id_str":"488926730481332224","type":"photo","display_url":"pic.twitter.com/DDco9wVXtP","url":"http://t.co/DDco9wVXtP"}],"user_mentions":[{"id":132204038,"name":"Mina*Bad Yoga Kitty*","indices":[0,12],"screen_name":"MinaANDMaya","id_str":"132204038"},{"id":2308374684,"name":"Julianna Kaminski","indices":[75,88],"screen_name":"HilbraesDogs","id_str":"2308374684"}]},"in_reply_to_status_id_str":null,"id":488926827394904064,"source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android<\/a>","in_reply_to_user_id_str":"132204038","favorited":false,"in_reply_to_status_id":null,"retweet_count":6,"created_at":"Tue Jul 15 06:03:34 +0000 2014","in_reply_to_user_id":132204038,"favorite_count":3,"id_str":"488926827394904064","place":null,"user":{"location":"CHICAGO , USA","default_profile":false,"statuses_count":8868,"profile_background_tile":true,"lang":"en","profile_link_color":"AD54E8","profile_banner_url":"https://pbs.twimg.com/profile_banners/225136520/1403608773","id":225136520,"following":null,"favourites_count":5082,"protected":false,"profile_text_color":"3D1957","verified":false,"description":"I'm Bulley, I'm proof that there is always hope.\r\nI was in rescue kennels in UK for 9yrs. @ada_bscakes took me in.\r\nWe've moved to America to start a new life.","contributors_enabled":false,"profile_sidebar_border_color":"000000","name":"BULLEY","profile_background_color":"0A0A0A","created_at":"Fri Dec 10 19:55:17 +0000 2010","default_profile_image":false,"followers_count":3421,"profile_image_url_https":"https://pbs.twimg.com/profile_images/486614595457789952/gtcLac9w_normal.jpeg","geo_enabled":true,"profile_background_image_url":"http://pbs.twimg.com/profile_background_images/378800000166829702/isbjd7O4.jpeg","profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/378800000166829702/isbjd7O4.jpeg","follow_request_sent":null,"url":null,"utc_offset":-39600,"time_zone":"International Date Line West","notifications":null,"profile_use_background_image":true,"friends_count":3702,"profile_sidebar_fill_color":"7AC3EE","screen_name":"BulleyBufton","id_str":"225136520","profile_image_url":"http://pbs.twimg.com/profile_images/486614595457789952/gtcLac9w_normal.jpeg","listed_count":29,"is_translator":false},"coordinates":null},"geo":null,"entities":{"trends":[],"symbols":[],"urls":[{"expanded_url":"https://www.animalfriendsquote.co.uk/fb-worldcup/","indices":[111,134],"display_url":"animalfriendsquote.co.uk/fb-worldcup/","url":"https://t.co/Y8g47fLYY1"}],"hashtags":[],"media":[{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":455},"large":{"w":706,"resize":"fit","h":946},"medium":{"w":600,"resize":"fit","h":803}},"id":488926730481332224,"media_url_https":"https://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","media_url":"http://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","expanded_url":"http://twitter.com/BulleyBufton/status/488926827394904064/photo/1","source_status_id_str":"488926827394904064","indices":[139,140],"source_status_id":488926827394904064,"id_str":"488926730481332224","type":"photo","display_url":"pic.twitter.com/DDco9wVXtP","url":"http://t.co/DDco9wVXtP"}],"user_mentions":[{"id":225136520,"name":"BULLEY","indices":[3,16],"screen_name":"BulleyBufton","id_str":"225136520"},{"id":132204038,"name":"Mina*Bad Yoga Kitty*","indices":[18,30],"screen_name":"MinaANDMaya","id_str":"132204038"},{"id":2308374684,"name":"Julianna Kaminski","indices":[93,106],"screen_name":"HilbraesDogs","id_str":"2308374684"}]},"source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android<\/a>","favorited":false,"in_reply_to_user_id":null,"retweet_count":0,"id_str":"488927960280211456","user":{"location":"","default_profile":false,"statuses_count":1370,"profile_background_tile":true,"lang":"zh-tw","profile_link_color":"038544","profile_banner_url":"https://pbs.twimg.com/profile_banners/2272804116/1404662156","id":2272804116,"following":null,"favourites_count":2000,"protected":false,"profile_text_color":"333333","verified":false,"description":"No More Sorrow","contributors_enabled":false,"profile_sidebar_border_color":"000000","name":"Winnie","profile_background_color":"14DBBA","created_at":"Thu Jan 02 10:13:01 +0000 2014","default_profile_image":false,"followers_count":311,"profile_image_url_https":"https://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg","geo_enabled":false,"profile_background_image_url":"http://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg","profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg","follow_request_sent":null,"url":null,"utc_offset":null,"time_zone":null,"notifications":null,"profile_use_background_image":true,"friends_count":455,"profile_sidebar_fill_color":"DDEEF6","screen_name":"winnie341881","id_str":"2272804116","profile_image_url":"http://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg","listed_count":0,"is_translator":false}}

<强> PigScript:

REGISTER '/tmp/elephant-bird-hadoop-compat-4.1.jar';
REGISTER '/tmp/elephant-bird-pig-4.1.jar';

A = LOAD 'input.json ' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;
B = FOREACH A GENERATE myMap#'id' AS ID,myMap#'created_at' AS createdAT,myMap#'user' AS User;
DUMP B;

<强>输出:

(488927960280211456,Tue Jul 15 06:08:04 +0000 2014,[location#,default_profile#false,profile_background_tile#true,statuses_count#1370,lang#zh-tw,profile_link_color#038544,profile_banner_url#https://pbs.twimg.com/profile_banners/2272804116/1404662156,id#2272804116,following#,protected#false,favourites_count#2000,profile_text_color#333333,contributors_enabled#false,description#No More Sorrow,verified#false,name#Winnie,profile_sidebar_border_color#000000,profile_background_color#14DBBA,created_at#Thu Jan 02 10:13:01 +0000 2014,default_profile_image#false,followers_count#311,geo_enabled#false,profile_image_url_https#https://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg,profile_background_image_url#http://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg,profile_background_image_url_https#https://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg,follow_request_sent#,url#,utc_offset#,time_zone#,notifications#,friends_count#455,profile_use_background_image#true,profile_sidebar_fill_color#DDEEF6,screen_name#winnie341881,id_str#2272804116,profile_image_url#http://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg,is_translator#false,listed_count#0])

elephantbird library中,所有值都为stored as key/value pair(ie MAP datatype),因此很容易从加载的数据中提取必填字段。

在上面的脚本中,我根据您的需要提取了'id','created_at' and 'user'的值。 假设您要从用户&#39;中提取一些字段。数据(ex: 'friends_count' and 'followers_count'),在这种情况下,您需要投影'user'字段并提取所需的数据。示例代码如下。

<强> PigScript:

REGISTER '/tmp/elephant-bird-hadoop-compat-4.1.jar';
REGISTER '/tmp/elephant-bird-pig-4.1.jar';

A = LOAD 'input.json ' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;
B = FOREACH A GENERATE 'user' AS User;
C = FOREACH B GENERATE User#'friends_count', User#'followers_count';
DUMP C;

<强>输出:

(455,311)