猪 - 复制加入

时间:2014-01-22 23:23:18

标签: hadoop apache-pig

我有两个输入文件

学生档案:

abc 30 4.5
xyz 34 9.5
def 28 6.5
klm 35 10.5

位置文件:

abc hawthorne
xyz artesia
def garnet
klm vanness

我想要的输出

abc hawthorne
xyz artesia
def garnet
klm vanness 

为实现这一目标,我写了以下猪计划。

A = LOAD '/user/hive/warehouse/students.txt' USING PigStorage(' ') AS (NAME:CHARARRAY,AGE:INT,GPA:FLOAT);
B = LOAD '/user/hive/warehouse/location.txt.txt' using PigStorage(' ') AS (NAME:CHARARRAY,LOCATION:CHARARRAY);
C = JOIN A BY NAME , B BY LOCATION USING 'replicated';
DUMP C;

麻烦的是我没有看到任何输出消息。最重要的是,我在执行时看到以下警告:

2014-01-22 15:18:15,829 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 2 time(s).
2014-01-22 15:18:15,829 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 2 time(s).
2014-01-22 15:18:15,829 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - Success!
2014-01-22 15:18:15,829 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - Success!
2014-01-22 15:18:15,832 [main] INFO  org.apache.pig.data.SchemaTupleBackend  - Key [pig.schematuple] was not set... will not generate code.
2014-01-22 15:18:15,832 [main] INFO  org.apache.pig.data.SchemaTupleBackend  - Key [pig.schematuple] was not set... will not generate code.
2014-01-22 15:18:15,841 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat  - Total input paths to process : 1
2014-01-22 15:18:15,841 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil  - Total input paths to process : 1
2014-01-22 15:18:15,841 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil  - Total input paths to process : 1
Hadoop Job IDs executed by Pig: job_201401210934_0082,job_201401210934_0083

1 个答案:

答案 0 :(得分:2)

我觉得你没有看到任何输出,因为加入不会导致任何匹配。 您正在通过A (abc,xyz,def,klm) &创建NAME联接来自B (hawthorne,artesia,garnet,vanness)的位置 如果你看到两个数据集中没有匹配的字符串,那么导致没有连接。