加入猪的独特性

时间:2014-03-26 21:11:49

标签: hadoop apache-pig

我有两张桌子:

Name_SSN和Phone_Address

Name_SSN包含 乔xxx-xx-xxxx 吉姆xxx-xx-xxxx Bob xxx-xx-xxxx

Phone_Address 乔999-999-9990日落佛罗里达 乔999-999-9991日落佛罗里达 Joe 999-999-9992日落佛罗里达 Jim 999-999-9994 Sunny CA. Jim 999-999-9994 Sunny CA. Bob 999-999-9999 Raleigh VA

我想加入并获得: Joe xxx-xx-xxxx日落佛罗里达 Jim xxx-xx-xxxx Sunny CA. Bob xxx-xx-xxxx Raleigh VA

我是猪的新手并且无能为力......

感谢您的协助,

克里斯

1 个答案:

答案 0 :(得分:0)

听起来你想要在猪身上进行内部联接。以下代码可以帮助您:

<强> NameSSNAddr.pig

--Load the two data files
namessn = LOAD 'Name_SSN.csv' USING PigStorage(',') AS (name:chararray, ssn:chararray);
phoneaddr = LOAD 'Phone_Address.csv' USING PigStorage(',') AS (name:chararray, phone:chararray, address:chararray);

--Perform the join of the two datasets on the "name" field
data_join = JOIN namessn BY name, phoneaddr BY name;

--The join combined all fields from both datasets.  
--We just want a few fields, so generate them specifically.
data = FOREACH data_join GENERATE namessn::name AS name, namessn::ssn AS ssn, phoneaddr::address AS address;

--You didn't say if you wanted the data distinct or not.
--If you want only one row per distinct user, use this alias.
data_distinct = DISTINCT data;

--Dump all of the aliases so you can see what's in them.
dump namessn;
dump phoneaddr;

dump data;
dump data_distinct;



dump namessn

的输出
(Joe,xxx-xx-xxx1)
(Jim,xxx-xx-xxx2)
(Bob,xxx-xx-xxx3)



dump phoneaddr

的输出
(Joe,999-999-9990,Sunset Florida)
(Joe,999-999-9991,Sunset Florida)
(Joe,999-999-9992,Sunset Florida)
(Jim,999-999-9994,Sunny CA)
(Jim,999-999-9994,Sunny CA)
(Bob,999-999-9999,Raleigh VA)



dump data

的输出
(Bob,xxx-xx-xxx3,Raleigh VA)
(Jim,xxx-xx-xxx2,Sunny CA)
(Jim,xxx-xx-xxx2,Sunny CA)
(Joe,xxx-xx-xxx1,Sunset Florida)
(Joe,xxx-xx-xxx1,Sunset Florida)
(Joe,xxx-xx-xxx1,Sunset Florida)



dump data_distinct

的输出
(Bob,xxx-xx-xxx3,Raleigh VA)
(Jim,xxx-xx-xxx2,Sunny CA)
(Joe,xxx-xx-xxx1,Sunset Florida)