我最近一直在使用Apache Pig。我想根据yelp中的数据集提取几列。请仔细查看我使用的代码。我尝试在Hortonworks平台以及我的机器(Ubuntu)中运行它们。我得到与不同列对应的结果作为输出。请指出我犯错的地方..
查询:
grunt> business = load 'yelp_academic_dataset_business.json'
using JsonLoader('name:chararray, state:chararray');
grunt> business_name = foreach business generate name, state;
grunt> toPrint = limit business_name 5;
grunt> dump toPrint;
输出:
(5AJdS8LYpCgzfOwGaEqZkA,14362 N Frank Lloyd Wright Blvd Ste B104 Scottsdale,AZ 85260) (6UXw7_U13Th0PZlMXZbjMg,位于内华达州拉斯维加斯东南D1大门对面的麦卡伦机场) (80VmGCy6UcYYCKC_BONZTQ,524 N 92nd St Scottsdale,AZ 85256) (95p9Xg358BezJyk1wqzzyg,5114 Farwell St Mc Farland,WI 53558) (EkhrRWzevfFJc8Pm2dVPEA,140 University Water W Waterloo,ON N2L 3W6)
文件中的示例输入:
{
"business_id": "vcNAWiLM4dR7D2nwwJ7nCA",
"full_address": "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018",
"hours": {
"Tuesday": {"close": "17:00", "open": "08:00"},
"Friday": {"close": "17:00", "open": "08:00"},
"Monday": {"close": "17:00", "open": "08:00"},
"Wednesday": {"close": "17:00", "open": "08:00"},
"Thursday": {"close": "17:00", "open": "08:00"}
},
"open": true,
"categories": ["Doctors", "Health & Medical"],
"city": "Phoenix", "review_count": 7,
"name": "Eric Goldberg, MD",
"neighborhoods": [],
"longitude": -111.98375799999999,
"state": "AZ",
"stars": 3.5,
"latitude": 33.499313000000001,
"attributes": {"By Appointment Only": true},
"type": "business"
}
编辑2:
我还将elephant-bird-2.2.3.jar文件复制到/ hadoop / bin文件夹中。在那个文件夹中,我打电话给#34; pig -x local"以本地模式启动PIG。一旦启动,我会进行elphant-bird-2.2.3.jar的注册,然后继续查询。
包含elephant-jar之后:
grunt> register elphant-bird-2.2.3.jar;
grunt> business = load 'yelp_academic_dataset_business.json'
using JsonLoader('name:chararray, state:chararray');
grunt> business_name = foreach business generate name, state;
grunt> toPrint = limit business_name 5;
grunt> dump toPrint;
答案 0 :(得分:0)
对于嵌套的json,您可以使用以下脚本加载它 -
register elephant-bird-pig-4.4.jar;
register elephant-bird-core-4.4.jar;
register elephant-bird-hadoop-compat-4.4.jar;
register google-collections-1.0-rc1.jar;
register json_simple-1.1.jar;
business = load 'yelp_academic_dataset_business.json' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
dump business;
或者,您也可以将其加载为 -
business = load 'pigJsontest.txt' using com.twitter.elephantbird.pig.load.JsonLoader() as (json:map[]);
这些脚本中的任何一个都会以地图格式[密钥,值格式]加载json数据,请点击此处 - http://pig.apache.org/docs/r0.11.1/basic.html#map-schema
您需要通过迭代此地图来访问特定元素。例如 -
id_state = foreach business generate (CHARARRAY)$0#'business_id' as business_id, (CHARARRAY)$0#'state' as state;
dump id_state;