在Apache Pig

时间:2017-10-24 13:39:27

标签: join apache-pig dump

我在猪身上有两个数据对象。

_1:

col_a: chararray,
col_b: int,
col_c: int,
col_d: chararray

_2:

col_a: chararray,
col_b: chararray,
col_c: int,
col_d: int,
col_e: int

我想加入其中两个,我试过了:

all_data = JOIN data_1 BY (col_a) LEFT, data_2 by (col_b);
all_data = JOIN data_1 BY (col_a), data_2 by (col_b);

当我尝试转储对象时(将其限制为10条记录后)两个选项都给出了同样的错误:

Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: all_data_limit: Limit - scope-6383 Operator Key: scope-6383): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: all_data: New For Each(true,true)[tuple] - scope-6382 Operator Key: scope-6382): org.apache.pig.backend.executionengine.ExecException: ERROR 0: java.lang.ClassCastException: org.apache.pig.impl.io.NullableText cannot be cast to org.apache.pig.impl.io.NullableBytesWritable
  • "描述"对于这两个对象(data_1,data_2)给出了良好的输出(我在顶部写的)
  • "描述"对于Joined对象 - all_data,它也应该返回一个好的输出。
  • 我为这两个对象打印了LIMIT 10 - 它们有很好的数据。
  • 我正在使用Amazon群集" emr-5.2.0",猪版本为0.16.0

我有点沮丧,无法找到解决方案,我现在正在寻找一个3天... 任何帮助都会很棒。 谢谢!

1 个答案:

答案 0 :(得分:1)

使用以下命令

all_data = JOIN data_1 BY TRIM(col_a) LEFT, data_2 by TRIM(col_b);
all_data = JOIN data_1 BY TRIM(col_a), data_2 by TRIM(col_b);

让我知道它是否正常运行。