通过Pig过滤掉NULL值

时间:2014-10-16 18:06:26

标签: null row apache-pig

我有通过以下阅读的表格。

 A = load 'customer' using PigStorage('|');

在客户关注中有一些行

7|Ron|ron@abc.com
8|Rina  
9|Don|dmes@xyz.com
9|Don|dmes@xyz.com
10|Maya|maya@cnn.com

11|marry|mary@abc.com

当我使用以下....

B = DISTINCT A;
A_CLEAN = FILTER B by ($0 is not null) AND ($1 is not null) AND ($2 is not null);

它删除 8 | Rina以及

如何通过Pig删除空行?

我有办法尝试吗?     A_CLEAN =过滤器B不是IsNULL()???

我是猪新手,所以不确定我把它放在IsNULL里面......

由于

A_CLEAN =过滤器B不是IsEmpty(B);

2 个答案:

答案 0 :(得分:2)

尝试以下方法:

A = LOAD 'customer' USING PigStorage('|');
B = DISTINCT A;
A_CLEAN = FILTER B BY NOT(($0 IS NULL) AND ($1 IS NULL) AND ($2 IS NULL));
DUMP A_CLEAN;

这将产生输出:

(8,Rina)
(7,Ron,ron @ abc.com)
(9,Don,dmes @ xyz.com)
(10,Maya,maya @ cnn.com)
(11,结婚,玛丽@ abc.com)

在PIG中,你无法测试元组的空虚。

答案 1 :(得分:0)

 Tarun, instead AND condition why can't you put OR condition.
        A_CLEAN = FILTER B by ($0 is not null) OR ($1 is not null) OR ($2 is not null);
 This will remove all the null rows and retain if any columns is not empty. 
 Can you try and let me know if this works for your all conditions?

更新:
我不知道为什么IsEmpty()不适合你,它为我工作。 IsEmpty只适用于包,所以我将所有的字段转换为包并测试空虚。见下面的工作代码。

input.txt
7|Ron|ron@abc.com
8|Rina
9|Don|dmes@xyz.com
9|Don|dmes@xyz.com
10|Maya|maya@cnn.com

11|marry|mary@abc.com

PigSCript:
A = LOAD 'input.txt' USING PigStorage('|');
B = DISTINCT A;
A_CLEAN = FILTER B BY NOT IsEmpty(TOBAG($0..));
DUMP A_CLEAN;

Output:
(8,Rina  )
(7,Ron,ron@abc.com)
(9,Don,dmes@xyz.com)
(10,Maya,maya@cnn.com)
(11,marry,mary@abc.com)

对于您的另一个问题,它是一个简单的数学计算

In case of AND, 
8|Rina
 will be treated as
 ($0 is not null) AND ($1 is not null) AND ($2 is not null)
 (true) AND (true) AND (false)
 (false) -->so this record will be skipped by Filter command

In case of OR, 
8|Rina
 will be treated as
 ($0 is not null) OR ($1 is not null) OR ($2 is not null)
 (true) OR (true) OR (false)
 (true) -->so this record will be included into the relation by Filter command

In case of empty record, 
<empty record>
  will be treated as
  ($0 is not null) OR ($1 is not null) OR ($2 is not null)
  (false) OR (false) OR (false)
  (false) -->so this record will be skipped by Filter command