使用piglatin按整数列表过滤列表

时间:2017-07-14 17:44:47

标签: filter apache-pig

我有一个如下所示的列表:lista.csv:

client-id    priority    client-start    assignment
12345        1            1250125125     13
1246         3            1250122156     27
12616        1            1250122351     3
...

我有另一个列表,看起来像矢量listb.csv:

125125
124214
1246
125
...

我想要做的是过滤所有客户的列表,我的ID也可以在listb中找到。

我尝试了类似的东西,但它不起作用:

raw = LOAD 'lista.csv' USING PigStorage('\t') AS (client-id: int, priority: 
int, client-start: int, assignment: int); 
s4q = LOAD 'listb.csv' USING PigStorage('\t') AS (survs4id: int);
s4id = FOREACH s4q {
dd = FILTER raw by (client-id == s4q);
GENERATE dd;
}
DUMP dd;

任何想法如何解决这个问题?

1 个答案:

答案 0 :(得分:0)

加入这两个关系以仅获取匹配的记录。这将充当过滤器。

raw = LOAD 'lista.csv' USING PigStorage('\t') AS (client-id: int, priority: int, client-start: int, assignment: int); 
s4q = LOAD 'listb.csv' USING PigStorage('\t') AS (survs4id: int);
s4id = JOIN raw BY client-id,s4q BY survs4id;
dd = FOREACH s4id GENERATE s4id.$0,s4id.$1,s4id.$2,s4id.$3;
DUMP dd;