我正在尝试自己学习猪,我有以下脚本:
customer_ratings = LOAD 'customer_ratings.txt' as (i_id:int, customer_id:int, rating:int);
item_data = LOAD 'item_data.txt' USING PigStorage(',') as (item_id:int,item_name:chararray, dummy:int,item_url:chararray);
item_join = join item_data by item_id, customer_ratings by i_id;
item_group = GROUP item_join ALL;
item_foreach = foreach item_group generate item_id, item_name, item_url, AVG(item_join.rating);
PRINT = limit item_foreach 40;
dump PRINT;
foreach失败并出现以下错误:
Invalid field projection. Projected field [item_id] does not exist in schema: group:char array,item_join:bag{:tuple(item_data::item_id:int,item_data::item_name:char array,item_data::dummy:int,item_data::item_url:chararray,customer_ratings::i_id:int,customer_ratings::customer_id:int,customer_ratings::rating:int)}.
我知道有些东西我通过教程无法理解,以实现这一目标......任何想法如何打印我在foreach
中的内容?
我也按照pig - how to reference columns in a FOREACH after a JOIN?中的说明尝试了generate item_data::item_id, item_data::item_name, etc.
,但这也行不通......
答案 0 :(得分:2)
customer_ratings = LOAD 'customer_ratings.txt' as (i_id:int,customer_id:int, rating:int);
item_data = LOAD 'item_data.txt' USING PigStorage(',') as (item_id:int,item_name:chararray, dummy:int,item_url:chararray);
item_join = foreach (
join item_data by item_id,
customer_ratings by i_id
)
generate
item_data::item_id as item_id,
item_data::item_name as item_name,
cutsomer_rating::rating as rating
;
item_group = GROUP item_join by (item_id, item_url);
item_foreach = foreach item_group generate
FLATTEN(group) as (item_id, item_url),
AVG(item_join.rating)
;
PRINT = limit item_foreach 40;
dump PRINT;
我觉得这样的事情很有效。虽然我还没有测试过。我做了两件事。首先,在连接之后,我将字段命名为简单的字段,这样我们就不必携带一堆名为relation.fieldname的字段。
扁平化群组是一种更容易的方法,可以将密钥从群组中删除。在您的示例中,我认为您需要使用类似
的内容generate item_join.item_data::item_id