apache pig JOIN表现不尽如人意

时间:2014-03-30 17:43:45

标签: inner-join apache-pig

我是apache pig的新手。我用tab分隔的字段创建了2个文件; employees.txt和employees2.txt [文件中没有行间距,这是为了使这个编辑器满意。]

employees.txt包含:

joe     21      94085   50000.0
Tom     21      94085   50000.0
John    21      94085   50000.0



employees2.txt包含:

joe     4085559898
joe     4085559899
tom     4085559897
tom     4085559896
john    4085559896



然后我尝试一个简单的加入:

e1 = LOAD 'employees.txt' AS (name, age, zip, salary);
e2 = LOAD 'employees2.txt' AS (name, phone);
e3 = JOIN e1 BY name, e2 BY name;
DUMP e3;



结果:

(joe,21,94085,50000.0,joe,4085559899)
(joe,21,94085,50000.0,joe,4085559898)



我期待:

(joe,21,94085,50000.0,joe,4085559899)
(joe,21,94085,50000.0,joe,4085559898)
(Tom,21,94085,50000.0,Tom,4085559897)
(Tom,21,94085,50000.0,Tom,4085559896)
(joe,21,94085,50000.0,Tom,4085559896)



我做错了什么?

谢谢,

克里斯

1 个答案:

答案 0 :(得分:1)

与几乎所有计算机语言一样,Pig也区分大小写。因此" Joe" !="乔"和"汤姆" !="汤姆"。

您应该将employees.txt文件中的名称更改为小写。然后你应该得到预期的结果。

您可以使用内置的Pig String函数LOWER来完成将name字段转换为全小写的任务。

有些事情:

e1 = LOAD 'employees.txt' AS (name, age, zip, salary);
e2 = LOAD 'employees2.txt' AS (name, phone);
e1_lower = FOREACH e1 GENERATE LOWER(name),age,zip,salary;
e3 = JOIN e1_lower BY name, e2 BY name;
DUMP e3;