我有两套
personCounts
(personName:chararray, count:int)
whitelist
(empID:int, empName:chararray)
我想要的是亲身的人,但不是白名单。我知道JOIN会返回两者中出现的元素。有没有办法返回那些会被丢弃的东西?我以为我可以用CROSS做到这一点,但是我认为我会有额外的东西......?
crossed = CROSS personCounts BY personName, whitelist BY empName;
filcrs = FILTER crossed BY NOT personCounts::personName MATCHES whitelist::empName;
答案 0 :(得分:2)
我认为你想要实现的是personCounts和白名单之间的设置差异吗?
如果是这样,请尝试以下方法(未经测试!!!):
CGRP = COGROUP personCounts BY personName, whitelist BY empName;
PC_MINUS_WL = FILTER CGRP BY IsEmpty(whitelist);
PC_MINUS_WL = FOREACH PC_MINUS_WL GENERATE group AS name;
我发现以下两个资源很有用:
http://agiletesting.blogspot.de/2012/02/set-operations-in-apache-pig.html
答案 1 :(得分:2)
您可以通过JOIN FULL执行此操作。
joined = JOIN personCounts BY personName FULL, whitetlist BY empName;
joined = FILTER joined BY NOT $0 MATCHES '';
joined = FILTER joined BY $3 IS null;
然后加入的是(personName,count ,,'')