我想计算猪中2个田间组合的频率:
------ y1 has the fields -----
a1 = GROUP y1 BY (user_id, tweet_created_at);
a2 = FOREACH a1 GENERATE group AS (user_id, tweet_created_at), COUNT(y1) AS number_of_replies_by_user;
a3 = FOREACH a2 GENERATE user_id, tweet_created_at, number_of_replies_by_user;
a4 = JOIN y1 BY (user_id, tweet_created_at) LEFT OUTER, a3 BY (user_id, tweet_created_at);
在上面,我想计算(user_id, tweet_created_at
)字段组合的频率。
第a2 = FOREACH a1 GENERATE group AS (user_id, tweet_created_at), COUNT(y1) AS number_of_replies_by_user;
行会出错:Incompatable schema: left is "user_id:NULL,tweet_created_at:NULL", right is "group:tuple(user_id:bytearray,tweet_created_at:bytearray)"
我试过没有括号:a2 = FOREACH a1 GENERATE group AS user_id, tweet_created_at, COUNT(y1) AS number_of_replies_by_user;
我收到另一个错误:
Invalid field projection. Projected field [tweet_created_at] does not exist in schema:..................
这是语法错误还是我的数据问题? 如果语法错误,正确的方法是什么?
简而言之:我想计算每个推文发布时用户提供的回复数量。 (如果他在同一天发布了2条推文,他可能会在第一条推文时回复10,在第二条推文时回复15)。我想如果我不按tweet_created_at
分组,则回复计数将始终是一个常数,这是错误的。
答案 0 :(得分:2)
在组上使用FLATTEN来取消元组到字段
a2 = FOREACH a1 GENERATE FLATTEN(group) AS (user_id, tweet_created_at), COUNT(y1) AS number_of_replies_by_user;