PIG中的NOT IN子句

时间:2017-02-02 09:43:11

标签: hadoop mapreduce apache-pig

我想尝试

select * from A where A.ID NOT IN (select id from B) (in sql)

sourcenew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Source.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
destnew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Destination.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
c= FOREACH destnew GENERATE ID;
D=FILTER sourcenew BY NOT ID (c.ID);
 org.apache.pig.tools.pigscript.parser.ParseException: Encountered " <PATH> "D=FILTER "" at line 1, column 1.
Was expecting one of:
<EOF> 
"cat" ...
"clear" ...<EOF>

任何有关此问题的帮助以解决错误,在执行最后一行时获取此信息。

1 个答案:

答案 0 :(得分:1)

使用LEFT OUTER JOIN并过滤空值

sourcenew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Source.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
destnew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Destination.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
c = FOREACH destnew GENERATE ID;
d = JOIN sourcenew BY ID LEFT OUTER,destnew by ID;
e = FILTER d by destnew.ID is null;

注意 我编写了一个带有几个测试文件的示例脚本,下面是工作解决方案。在您的情况下,检查您是否正在从文件中正确加载数据。

<强> test1.txt的

1   abc
2   def
3   ghi
4   jkl
5   mno
6   pqr
7   stu
8   vwx
1   abc
2   def
3   ghi
4   jkl
1   abc
2   def
3   ghi
1   abc
2   def

<强>的test2.txt

1
2
3
4

<强>脚本

A = LOAD 'test1.txt' USING PigStorage('\t') AS (aid:int,name:chararray);
B = LOAD 'test2.txt' USING PigStorage('\t') AS (bid:int);
C = JOIN A BY aid LEFT OUTER,B BY bid;
D = FILTER C BY bid is null;
DUMP D;

所以在上面的例子中,记录5,6,7,8应该在结果中,因为那些ID不在test2.txt中。

Output