在猪身上创造一个巨大的过滤器

时间:2017-04-21 02:18:17

标签: hadoop apache-pig

我有这个代码。

large = load 'a super large file' 

CC = FILTER large BY $19 == 'abc OR $20 == 'abc' 
OR $19 == 'def' or $20 == 'def' ....;

OR条件的数量可以达到100s甚至数千。

有更好的方法吗?

1 个答案:

答案 0 :(得分:2)

是的,将这些条件放在另一个文件中。将其加载到关系中并加入列上的两个关系。如果必须对多个列进行过滤,则创建与条件一样多的过滤器文件.Below是2的示例专栏

large = load 'a super large file' 
filter1 = load 'file with values needed to compare with $19';
filter2 = load 'file with values needed to compare with $20';
f1 = JOIN large BY $19,filter1 BY $0;
f2 = JOIN large BY $20,filter2 BY $0;
final = UNION f1,f2;
DUMP final;

您可以使用包含多个列的1个过滤器文件并加入这些过滤器文件以获得不同的过滤结果,然后将这些关系结合起来。

large = load 'a super large file' 
filter_file = load 'file with values in different columns';

f1 = JOIN large BY $19,filter_file BY $0;
f2 = JOIN large BY $20,filter_file BY $1;
final = UNION f1,f2;
DUMP final;