当我在pig
加入时,我发现违反了我认为应该是一般不变量的内容。
我很欣赏我正在做的事情 - 或者在思考 - 错误的解释。
我有一个表(pig
术语中的别名)
user_action = distinct (foreach user_action generate action, user);
列出了参与某些操作的用户。请注意distinct
保证action
和user
索引别名。
我有另一个别名告诉我人们多少次考虑采取行动:
user_thoughts = foreach (group A by (action, user)) generate
group.action as action, group.user as user, COUNT(A) as tcount;
现在我join
采取行动的想法:
thought_relevance_per_user = foreach (join user_action by (user, action) left,
user_thought_count by (user, action)) generate
user_action::user as user, user_action::action as action,
(user_thoughts::tcount is NULL ? 0L : user_thoughts::tcount) as tcount;
thought_relevance = foreach (group thought_relevance_per_user
by (action, tcount)) generate
group.action as action, group.tcount as tcount,
COUNT(thought_relevance_per_user) as ucount;
我期望参与行动的用户数量如下:
user_counts = foreach (group user_action by action) generate
group as action, COUNT(user_action) as ucount;
并且像这样:
user_counts = foreach (group thought_relevance by action) generate
group as action, SUM(thought_relevance::ucount) as ucount;
是完全相同的。
它们不是 - 第二个是第一个小的10倍。
(我在离线user_counts
进行R
计算,因此pig
语法
以上可能是错误的。)
为什么呢?我的代码错了吗?我的期望是错的吗?
答案 0 :(得分:0)
代码是正确的,我看到垃圾的原因是我store
在联接之前编辑了别名并且损坏了它们。在我再次load
之后,我得到了正确的行为。