PIG-Hadoop无法使JOIN正常工作

时间:2018-07-30 19:20:27

标签: hadoop join mapreduce apache-pig

到目前为止,我正在与之合作:

db2010 = LOAD '/project/BP_2010_00A1.csv' USING PigStorage(',');
db2014 = LOAD '/project/BP_2014_00A1.csv' USING PigStorage(',');

倾倒第一个数据让您看一下:

first10 = limit db2010 10;
dump first10;

提供以下信息:

(0500000US01001,01001,"Autauga County, Alabama",,00,Total for all sectors,,2010,871,10167,63783)
(0500000US01001,01001,"Autauga County, Alabama",,11,"Agriculture, forestry, fishing and hunting",,2010,6)
(0500000US01001,01001,"Autauga County, Alabama",,21,"Mining, quarrying, and oil and gas extraction",,2010,5)
(0500000US01001,01001,"Autauga County, Alabama",,22,Utilities,,2010,9,187,3667)
(0500000US01001,01001,"Autauga County, Alabama",,23,Construction,,2010,86,486,3401)
(0500000US01001,01001,"Autauga County, Alabama",,42,Wholesale trade,,2010,29,,1294)
(0500000US01001,01001,"Autauga County, Alabama",,51,Information,,2010,10,89,791)
(0500000US01001,01001,"Autauga County, Alabama",,52,Finance and insurance,,2010,69,322,2779)
(0500000US01001,01001,"Autauga County, Alabama",,53,Real estate and rental and leasing,,2010,37,125,672)

然后,转储下一个文件,该文件的格式应相同(数据相同,但年份不同):

(GEO.id,GEO.id2,GEO.display-label,GEO.annotation.id,NAICS.id,NAICS.display-label,NAICS.annotation.id,YEAR.id,ESTAB,EMP,PAYQTR1,PAYANN)
(0500000US01001,01001,"Autauga County, Alabama",,00,Total for all sectors,,2014,817,10202,71561)
(0500000US01001,01001,"Autauga County, Alabama",,11,"Agriculture, forestry, fishing and hunting",,2014,6)
(0500000US01001,01001,"Autauga County, Alabama",,21,"Mining, quarrying, and oil and gas extraction",,2014,3)
(0500000US01001,01001,"Autauga County, Alabama",,22,Utilities,,2014,9,177,3596)
(0500000US01001,01001,"Autauga County, Alabama",,23,Construction,,2014,70,349,2819)
(0500000US01001,01001,"Autauga County, Alabama",,42,Wholesale trade,,2014,29,,)
(0500000US01001,01001,"Autauga County, Alabama",,31-33,Manufacturing,,2014,21,,)
(0500000US01001,01001,"Autauga County, Alabama",,44-45,Retail trade,,2014,165,2525,14304)
(0500000US01001,01001,"Autauga County, Alabama",,48-49,Transportation and warehousing,,2014,17,137,1161)

现在,我尝试根据$ 1和$ 4列(例如,上面第二个转储中的最后一个条目的($ 1,$ 4)加入)将是(01001,48-49)):

dbboth = JOIN db2010 BY ($1, $4), db2014 BY ($1, $4);

但是,这就是问题所在,当我转储dbboth时(我知道名字很糟糕),我只能写出一条记录:

(GEO.id,GEO.id2,GEO.display-label,GEO.annotation.id,NAICS.id,NAICS.display-label,NAICS.annotation.id,YEAR.id,ESTAB,EMP,PAYQTR1,PAYANN,GEO.id,GEO.id2,GEO.display-label,GEO.annotation.id,NAICS.id,NAICS.display-label,NAICS.annotation.id,YEAR.id,ESTAB,EMP,PAYQTR1,PAYANN)

两个文件中的第一个。

现在,仅查看转储的10个文件,我看到很多看起来像它们在该联接上应该匹配。谁能帮我弄清楚为什么这不如我预期的那样?

非常感谢。

0 个答案:

没有答案