我有两张桌子:
Name_SSN和Phone_Address
Name_SSN包含 乔xxx-xx-xxxx 吉姆xxx-xx-xxxx Bob xxx-xx-xxxx
Phone_Address 乔999-999-9990日落佛罗里达 乔999-999-9991日落佛罗里达 Joe 999-999-9992日落佛罗里达 Jim 999-999-9994 Sunny CA. Jim 999-999-9994 Sunny CA. Bob 999-999-9999 Raleigh VA
我想加入并获得: Joe xxx-xx-xxxx日落佛罗里达 Jim xxx-xx-xxxx Sunny CA. Bob xxx-xx-xxxx Raleigh VA
我是猪的新手并且无能为力......
感谢您的协助,
克里斯
答案 0 :(得分:0)
听起来你想要在猪身上进行内部联接。以下代码可以帮助您:
<强> NameSSNAddr.pig 强>
--Load the two data files
namessn = LOAD 'Name_SSN.csv' USING PigStorage(',') AS (name:chararray, ssn:chararray);
phoneaddr = LOAD 'Phone_Address.csv' USING PigStorage(',') AS (name:chararray, phone:chararray, address:chararray);
--Perform the join of the two datasets on the "name" field
data_join = JOIN namessn BY name, phoneaddr BY name;
--The join combined all fields from both datasets.
--We just want a few fields, so generate them specifically.
data = FOREACH data_join GENERATE namessn::name AS name, namessn::ssn AS ssn, phoneaddr::address AS address;
--You didn't say if you wanted the data distinct or not.
--If you want only one row per distinct user, use this alias.
data_distinct = DISTINCT data;
--Dump all of the aliases so you can see what's in them.
dump namessn;
dump phoneaddr;
dump data;
dump data_distinct;
dump namessn
(Joe,xxx-xx-xxx1)
(Jim,xxx-xx-xxx2)
(Bob,xxx-xx-xxx3)
dump phoneaddr
(Joe,999-999-9990,Sunset Florida)
(Joe,999-999-9991,Sunset Florida)
(Joe,999-999-9992,Sunset Florida)
(Jim,999-999-9994,Sunny CA)
(Jim,999-999-9994,Sunny CA)
(Bob,999-999-9999,Raleigh VA)
dump data
(Bob,xxx-xx-xxx3,Raleigh VA)
(Jim,xxx-xx-xxx2,Sunny CA)
(Jim,xxx-xx-xxx2,Sunny CA)
(Joe,xxx-xx-xxx1,Sunset Florida)
(Joe,xxx-xx-xxx1,Sunset Florida)
(Joe,xxx-xx-xxx1,Sunset Florida)
dump data_distinct
(Bob,xxx-xx-xxx3,Raleigh VA)
(Jim,xxx-xx-xxx2,Sunny CA)
(Joe,xxx-xx-xxx1,Sunset Florida)