我有像这样的猪脚本
a = load 'large_file' using PigStorage(',')
b = filter a by $16 = '12345678'
c = filter a by $16 = '456'
d = union b,c
store d into 'output.csv'
如果我想按值列表过滤a。例如,我想在第16列具有值大的值列表中时提取所有行。
在熊猫术语中它将是
df[df['col'].isin([one massive list])]
我正在使用猪版0.8
答案 0 :(得分:1)
对于Pig-0.8,在FILTER中使用多个 OR
b = filter a by $16 == '12345678' OR $16 == '456' OR $16 == 'anotherval';
从Pig-0.12.0开始,您可以使用 IN
运算符
b = filter a by $16 IN ('12345678', '456', ... );