从JsonLoader Apache Pig获取不正确的值

时间:2014-09-08 05:16:22

标签: apache-pig

我最近一直在使用Apache Pig。我想根据yelp中的数据集提取几列。请仔细查看我使用的代码。我尝试在Hortonworks平台以及我的机器(Ubuntu)中运行它们。我得到与不同列对应的结果作为输出。请指出我犯错的地方..

查询:

grunt> business = load 'yelp_academic_dataset_business.json' 
          using JsonLoader('name:chararray, state:chararray');

grunt> business_name = foreach business generate name, state;                                                    
grunt> toPrint = limit business_name 5;                                                                          
grunt> dump toPrint; 

输出:

(5AJdS8LYpCgzfOwGaEqZkA,14362 N Frank Lloyd Wright Blvd Ste B104 Scottsdale,AZ 85260) (6UXw7_U13Th0PZlMXZbjMg,位于内华达州拉斯维加斯东南D1大门对面的麦卡伦机场) (80VmGCy6UcYYCKC_BONZTQ,524 N 92nd St Scottsdale,AZ 85256) (95p9Xg358BezJyk1wqzzyg,5114 Farwell St Mc Farland,WI 53558) (EkhrRWzevfFJc8Pm2dVPEA,140 University Water W Waterloo,ON N2L 3W6)

文件中的示例输入:

{
   "business_id": "vcNAWiLM4dR7D2nwwJ7nCA", 
   "full_address": "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018", 
   "hours": {
            "Tuesday": {"close": "17:00", "open": "08:00"}, 
            "Friday": {"close": "17:00", "open": "08:00"}, 
            "Monday": {"close": "17:00", "open": "08:00"}, 
            "Wednesday": {"close": "17:00", "open": "08:00"},
            "Thursday": {"close": "17:00", "open": "08:00"}
            },
    "open": true,
    "categories": ["Doctors", "Health & Medical"], 
    "city": "Phoenix", "review_count": 7,
    "name": "Eric Goldberg, MD",
    "neighborhoods": [], 
    "longitude": -111.98375799999999,
    "state": "AZ",
    "stars": 3.5,
    "latitude": 33.499313000000001,
    "attributes": {"By Appointment Only": true},
    "type": "business"
}

编辑2:

我还将elephant-bird-2.2.3.jar文件复制到/ hadoop / bin文件夹中。在那个文件夹中,我打电话给#34; pig -x local"以本地模式启动PIG。一旦启动,我会进行elphant-bird-2.2.3.jar的注册,然后继续查询。

包含elephant-jar之后:

grunt> register elphant-bird-2.2.3.jar;
grunt> business = load 'yelp_academic_dataset_business.json' 
              using JsonLoader('name:chararray, state:chararray');

grunt> business_name = foreach business generate name, state;                                                    
grunt> toPrint = limit business_name 5;                                                                          
grunt> dump toPrint; 

1 个答案:

答案 0 :(得分:0)

对于嵌套的json,您可以使用以下脚本加载它 -

register elephant-bird-pig-4.4.jar;
register elephant-bird-core-4.4.jar;
register elephant-bird-hadoop-compat-4.4.jar;
register google-collections-1.0-rc1.jar;
register json_simple-1.1.jar;

business = load 'yelp_academic_dataset_business.json' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
dump business;

或者,您也可以将其加载为 -

business = load 'pigJsontest.txt' using com.twitter.elephantbird.pig.load.JsonLoader() as (json:map[]);

这些脚本中的任何一个都会以地图格式[密钥,值格式]加载json数据,请点击此处 - http://pig.apache.org/docs/r0.11.1/basic.html#map-schema

您需要通过迭代此地图来访问特定元素。例如 -

id_state = foreach business generate (CHARARRAY)$0#'business_id' as business_id, (CHARARRAY)$0#'state' as state;
dump id_state;