我将一个hadoop作业的avro输出提供给另一个hadoop作业。第一个作业只使用以下设置运行映射器。如果它有用,我的avsc文件定义了这样的复合对象:
[
{
"type": "record",
"name": "MySubRecord",
"namespace": "blah",
"fields": [
{"name": "foobar", "type": ["null","string"], "default":null},
{"name": "bar","type": ["null","string"], "default":null},
{"name": "foo","type": ["null","string"], "default":null},
]
},{
"type": "record",
"name": "MyRecord",
"namespace" : "blah",
"fields" : [
{"name": "ID", "type":["null", "string"], "default":null},
{"name": "secondID", "type":["null", "string"], "default":null},
{"name": "subRecordA", "type":["null","blah.MySubRecord"], "default":null},
{"name": "subRecordB", "type":["null","blah.MySubRecord"], "default":null},
{"name": "subRecordC", "type":["null","blah.MySubRecord"], "default":null},
{"name": "subRecordD", "type":["null","blah.MySubRecord"], "default":null},
{"name": "subRecordE", "type":["null","blah.MySubRecord"], "default":null},
{"name": "subRecordF", "type":["null","blah.MySubRecord"], "default":null},
{"name": "subRecordG", "type":["null","blah.MySubRecord"], "default":null},
{"name": "subRecordH", "type":["null","blah.MySubRecord"], "default":null}
]
}
]
虽然我的mapper类签名如下所示:
public static class MyMapper extends Mapper<LongWritable, Text, AvroKey<MyRecord>, NullWritable>
使用如下设置方法:
protected void setup(Context context) throws IOException, InterruptedException {
super.setup(context);
keyOut = new AvroKey<>();}
映射器代码如下所示
protected void map(LongWritable keyIn, Text valueIn, Context context) throws IOException, InterruptedException {
MyRecord record;
record = getMyRecordFunction();
keyOut.datum(record);
context.write(keyOut, NullWritable.get());
}
我的第一份工作中的逻辑看起来不错,因为当我使用命令行avro-tools jar将我的输出打印到json时,它看起来像我期望的那样。
当我开第二份工作时,我的问题就出现了。我的第二份工作的映射器具有以下设置:
public static class MySecondJobMapper extends Mapper<AvroKey<MyRecord>, NullWritable, IntWritable, DoubleWritable>
我的问题发生在第二份工作中map方法的最开头。我的地图方法如下所示:
protected void map(AvroKey<MyRecord> key, NullWritable value, Context context) throws IOException, InterruptedException {
MyRecord myRecord = key.datum();
##### some other logic
每次我运行第二份工作时,都会收到以下错误:
16/07/28 18:24:38 WARN mapred.LocalJobRunner: job_local1682958846_0001
java.lang.Exception: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to MyRecord
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to MyRecord
at your.class.path$StatsCalculatorMapper.map(YourSecondJob.java:150)
at your.class.path$StatsCalculatorMapper.map(YourSecondJob.java:110)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
答案 0 :(得分:0)
看起来这个问题源于我在本地伪分布式环境中进行测试,并且我的pom.xml中指定的正确avro版本没有被提取。相反,旧版本的avro与this bug在没有意识到的情况下被吸引进来。一旦我在EMR上运行相同的程序,它工作得很好,因为正在使用正确版本的avro。