pig - 如何在JOIN之后引用FOREACH中的列?

时间:2011-11-08 13:32:36

标签: apache-pig

A = load 'a.txt' as (id, a1);
B = load 'b.txt as (id, b1);
C = join A by id, B by id;
D = foreach C generate id,a1,b1;
dump D;

第4行失败: Invalid field projection. Projected field [id] does not exist in schema

我尝试更改为A.id,但最后一行失败:ERROR 0: Scalar has more than one row in the output.

2 个答案:

答案 0 :(得分:45)

您要找的是"Disambiguate Operator"。你想要的是A::id,而不是A.id

A.id说“有关系/包 A并且其架构中有一个名为id的列”

A::id说“来自A记录,并且有一个名为id的列

所以,你会这样做:

A = load 'a.txt' as (id, a1);
B = load 'b.txt as (id, b1);
C = join A by id, B by id;
D = foreach C generate A::id,a1,b1;
dump D;

一个肮脏的选择:

仅仅因为我很懒,当你开始一个接一个地进行多个连接时,消歧会变得非常奇怪:使用唯一标识符。

A = load 'a.txt' as (ida, a1);
B = load 'b.txt as (idb, b1);
C = join A by ida, B by idb;
D = foreach C generate ida,a1,b1;
dump D;

答案 1 :(得分:0)

@nweiler:如果你知道关系A的第一个和最后一个字段,那么你可以写下面的内容:

     D = FOREACH C GENERATE A::FirstCol..A:LastCol ;

这将为您提供FirstCol和LastCol之间的所有列。