猪:计算多列的频率

时间:2016-10-26 16:22:26

标签: apache-pig

我想计算猪中2个田间组合的频率:

------ y1 has the fields -----
a1 = GROUP y1 BY (user_id, tweet_created_at);
a2 = FOREACH a1 GENERATE group AS (user_id, tweet_created_at), COUNT(y1) AS number_of_replies_by_user;
a3 = FOREACH a2 GENERATE user_id, tweet_created_at, number_of_replies_by_user;
a4 = JOIN y1 BY (user_id, tweet_created_at) LEFT OUTER, a3 BY (user_id, tweet_created_at);

在上面,我想计算(user_id, tweet_created_at)字段组合的频率。

a2 = FOREACH a1 GENERATE group AS (user_id, tweet_created_at), COUNT(y1) AS number_of_replies_by_user;行会出错:Incompatable schema: left is "user_id:NULL,tweet_created_at:NULL", right is "group:tuple(user_id:bytearray,tweet_created_at:bytearray)"

我试过没有括号:a2 = FOREACH a1 GENERATE group AS user_id, tweet_created_at, COUNT(y1) AS number_of_replies_by_user;

我收到另一个错误:

Invalid field projection. Projected field [tweet_created_at] does not exist in schema:..................

这是语法错误还是我的数据问题? 如果语法错误,正确的方法是什么?

简而言之:我想计算每个推文发布时用户提供的回复数量。 (如果他在同一天发布了2条推文,他可能会在第一条推文时回复10,在第二条推文时回复15)。我想如果我不按tweet_created_at分组,则回复计数将始终是一个常数,这是错误的。

1 个答案:

答案 0 :(得分:2)

在组上使用FLATTEN来取消元组到字段

a2 = FOREACH a1 GENERATE FLATTEN(group) AS (user_id, tweet_created_at), COUNT(y1) AS number_of_replies_by_user;