加入不变违规

时间:2013-12-09 22:45:08

标签: join apache-pig

当我在pig加入时,我发现违反了我认为应该是一般不变量的内容。 我很欣赏我正在做的事情 - 或者在思考 - 错误的解释。

我有一个表(pig术语中的别名

user_action = distinct (foreach user_action generate action, user);

列出了参与某些操作的用户。请注意distinct保证actionuser索引别名。

我有另一个别名告诉我人们多少次考虑采取行动:

user_thoughts = foreach (group A by (action, user)) generate
  group.action as action, group.user as user, COUNT(A) as tcount;

现在我join采取行动的想法:

thought_relevance_per_user = foreach (join user_action by (user, action) left,
  user_thought_count by (user, action)) generate
  user_action::user as user, user_action::action as action,
  (user_thoughts::tcount is NULL ? 0L : user_thoughts::tcount) as tcount;
thought_relevance = foreach (group thought_relevance_per_user
  by (action, tcount)) generate
  group.action as action, group.tcount as tcount,
  COUNT(thought_relevance_per_user) as ucount;

我期望参与行动的用户数量如下:

user_counts = foreach (group user_action by action) generate
  group as action, COUNT(user_action) as ucount;

并且像这样:

user_counts = foreach (group thought_relevance by action) generate
  group as action, SUM(thought_relevance::ucount) as ucount;

是完全相同的。

它们不是 - 第二个是第一个的10倍。

(我在离线user_counts进行R计算,因此pig语法 以上可能是错误的。)

为什么呢?我的代码错了吗?我的期望是错的吗?

1 个答案:

答案 0 :(得分:0)

代码是正确的,我看到垃圾的原因是我store在联接之前编辑了别名并且损坏了它们。在我再次load之后,我得到了正确的行为。