Pig Latin删除数据包中的元组

时间:2015-10-19 13:54:03

标签: hadoop apache-pig

以下是导致我的问题的代码:

 a = LOAD 'tellers' using TextLoader() AS line;
 # convert a to charrarry
 b = foreach a generate (chararray)line;  
 # run through my UDF to create tuples
 c = foreach b generate myudfs.TellerParser5(line);  # ({(20),(5),(5),(10)(1),(1),(1),(1),(1),(5),(10),(10),(10)})....
 d = foreach c generate flatten(number); 
 e = group d by number; #{group: chararray,d: {(number: chararray)}}
 f = foreach e generate group, COUNT(d);  # f: {group: chararray,long}

在databag f中,我有一个空元组(,1)我想过滤/删除。

 dump f;
 (,1)
 (1,97)
 (5,49)
 (10,87)
 (20,24)

 describe f;
 f: {group: chararray,long}

我试过这个没有成功(没有改变):

 remove_tuple = filter f BY group is not null; 

3 个答案:

答案 0 :(得分:0)

小组是猪keyword。希望这可以在其他一些单词用于元组名称时起作用。

答案 1 :(得分:0)

可以使用!='null'作为条件来过滤NULL。我在下面作为输入。

(,1)
(1,97)
(5,49)
(10,87)
(20,24)

以下是我们如何过滤NULL。

A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:long);
B = FILTER A BY a!='null';
DUMP B;

因此,对于您的脚本,该行将类似于

 remove_tuple = filter f BY group!='null'; 

输出:

(1,97)
(5,49)
(10,87)
(20,24)

答案 2 :(得分:0)

我通过添加一个步骤并将其转换为int来解决。以下是步骤:

 e = foreach d generate (int)$0; # this is the key added step

 f = group e by number; #{group: chararray,d: {(number: chararray)}}
 g = foreach f generate group, COUNT(e);  # f: {group: chararray,long}
 h = foreach f generate group, SUM(e);
 i = filter g by $0 is not null; 
 dump i; 
 (1,97)
 (5,49)
 (10,87)
 (20,24)