我有一个 avro 架构,例如:
{
"type": "record",
"namespace": "quotes",
"name": "Quotes",
"fields": [
{
"name": "instrument",
"type": "string"
},
{
"name": "providerSentTime",
"type": "long"
},
{
"name": "bids",
"type": {
"type": "array",
"items": {
"name": "BidQuote",
"type": "record",
"fields": [
{
"name": "rate",
"type": "double"
},
{
"name": "liquidity",
"type": "double"
},
{
"name": "time",
"type": "long"
},
{
"name": "status",
"type": "int"
}
]
}
}
},
{
"name": "asks",
"type": {
"type": "array",
"items": {
"name": "AskQuote",
"type": "record",
"fields": [
{
"name": "rate",
"type": "double"
},
{
"name": "liquidity",
"type": "double"
},
{
"name": "time",
"type": "long"
},
{
"name": "status",
"type": "int"
}
]
}
}
}
]
}
当我想反序列化它时,我使用自定义 Java 方法,例如:
public static List<Quotes> deserialize(byte[] avroBytes)
throws IOException, InvocationTargetException {
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(avroBytes);
byteArrayInputStream.reset();
BinaryDecoder binaryDecoder = DecoderFactory.get().binaryDecoder(byteArrayInputStream, null);
SpecificDatumReader<Quotes> quoteDatumReader = new SpecificDatumReader<>(Quotes.class);
List<Quotes> sparkQuotes = new ArrayList<>();
while (!binaryDecoder.isEnd()) {
Quotes read = quoteDatumReader.read(null, binaryDecoder);
sparkQuotes.add(read);
}
return sparkQuotes;
}
这可以正常工作。
当我尝试在 spark 执行中反序列化文件时,我面临的问题发生,例如在以下代码中:
// I need the hadoopFile method in my context that isn't expressed in this snippet.
JavaPairRDD<Text, Text> textTextJavaPairRDD =
javaSparkContext.hadoopFile(
"Quotes.avro",
KeyValueTextInputFormat.class, Text.class, Text.class);
JavaRDD<byte[]> avroByteRDD = textTextJavaPairRDD.map(new Function<Tuple2<Text, Text>, byte[]>() {
@Override
public byte[] call(Tuple2<Text, Text> hdfsTextTextRow) throws Exception {
return hdfsTextTextRow._1.copyBytes();
}
});
JavaRDD<Quotes> quotesRDD = avroQuoteRdd.flatMap(new FlatMapFunction<byte[], Quotes>() {
@Override
public Iterable<Quotes> call(byte[] keyAvroByte) throws Exception {
return deserialize(avroByteRDD);
}
});
quotesRDD.collect();
所以当我执行这个时,我有异常:
<块引用>警告 TaskSetManager:在阶段 0.0 中丢失任务 0.0(TID 0,本地主机):java.io.EOFException
通过在我的反序列化方法中调用 avro 方法 read
引发。
为什么它会在 spark 执行之外工作,但当我尝试在 spark 内部执行相同操作时却对同一个文件失败。