根据Pig中的字符串匹配条件加入两个数据集

时间:2016-02-08 11:52:23

标签: join apache-pig

我是Pig的新手,我有两个数据集," highspender"和"反馈"。

Highspender:

Price,fname,lname
$50,Jack,Brown
$30,Rovin,Pall

的反馈:

date,Name,rate
2015-01-02,Jack B Brown,5
2015-01-02,Pall,4

现在我必须根据他们的名字加入这两个数据集。我的条件应该是fname或Highspender的lname应该与反馈的名称相匹配。如何加入这两个数据集?有什么想法吗?

1 个答案:

答案 0 :(得分:0)

您可以尝试以下脚本执行相同操作,只需根据您的数据替换名称

highs = LOAD 'highs' using PigStorage(',') as (Price:chararray,fname:chararray,lname:chararray);
feedback = LOAD 'feeds' using PigStorage(',') as (date:chararray,Name:chararray,rate:chararray);
out = JOIN highs BY fname, feedback BY Name;
out1 = JOIN highs BY lname, feedback BY Name;
final_out = UNION out,out1;

如需进一步的帮助,请参阅此Pig Reference manual

修改

根据使用字符串函数连接数据的注释脚本如下所示:

highs = LOAD 'highs' using PigStorage(',') as (Price:chararray,fname:chararray,lname:chararray);
feedback = LOAD 'feeds' using PigStorage(',') as (date:chararray,Name:chararray,rate:chararray);
crossout = cross highs, feedback;
final_lname = filter crossout by ( REPLACE (feedback::Name,highs::lname ,'') != feedback::Name);
final_fname = filter crossout by ( REPLACE (feedback::Name,highs::fname ,'') != feedback::Name);
final = UNION final_lname, final_fname;