NOT运营商将创建2个互斥的组

时间:2015-03-12 01:57:26

标签: apache-pig

在我的脚本中,我从多个文件中读取并使用一个正则表达式及其补充来对两组/类中的记录进行分区。我期待两个相互排斥的课程,但是当我计算记录时我没有找到... 所以我添加了一个SPLIT部分来查找我的约束及其补充未涵盖的记录的“其余”部分。结果(再次)不是预期的...... 我的剧本有什么问题?谢谢你的帮助!

预期的'数学':

 input: 1464 records
 ouputs: 264 + 870 + ???_330__?? 

脚本:

A = load 'input/*' using PigStorage('\t','-tagPath') as (src:chararray, content:chararray);
Ac = foreach (GROUP A all) generate COUNT(A);

B = filter A by content MATCHES '(^\\b[BCDFMSTX].*\\b\\:\\s{1}.*)';
Bc = foreach (GROUP B all) generate COUNT(B);

Bnot = filter A by NOT content MATCHES '(^\\b[BCDFMSTX].*\\b\\:\\s{1}.*)';
Bcnot = foreach (GROUP Bnot all) generate COUNT(Bnot);

SPLIT A INTO SET1 IF (content MATCHES '(^\\b[BCDFMSTX].*\\b\\:\\s{1}.*)')
              , SET2 IF (NOT content MATCHES '(^\\b[BCDFMSTX].*\\b\\:\\s{1}.*)')
              , SETn OTHERWISE;

STORE SET1 into 'output/set1';
STORE SET2 into 'output/set2';
STORE SETn into 'output/setn';

结果:

 Input(s):
 Successfully read 1464 records (49024 bytes) from: "hdfs://localhost:9000/user/dag/input/*"

 Output(s):
 Successfully stored 264 records (25276 bytes) in: "hdfs://localhost:9000/user/dag/output/set1"
 Successfully stored 870 records (84190 bytes) in: "hdfs://localhost:9000/user/dag/output/set2"
 Successfully stored 0 records in: "hdfs://localhost:9000/user/dag/output/setn"

1 个答案:

答案 0 :(得分:0)

我认为在330个案例中,内容为null。如果用content is null OR NOT content MATCHES '(^\\b[BCDFMSTX].*\\b\\:\\s{1}.*)'替换布尔表达式,它应该可以工作。

话虽如此,我并不认为这是非常直观的,我认为Pig应该抛出NullPointerException或至少记录警告。