Question

我有一个 avro 架构，例如：

{
  "type": "record",
  "namespace": "quotes",
  "name": "Quotes",
  "fields": [
    {
      "name": "instrument",
      "type": "string"
    },
    {
      "name": "providerSentTime",
      "type": "long"
    },
    {
      "name": "bids",
      "type": {
        "type": "array",
        "items": {
          "name": "BidQuote",
          "type": "record",
          "fields": [
            {
              "name": "rate",
              "type": "double"
            },
            {
              "name": "liquidity",
              "type": "double"
            },
            {
              "name": "time",
              "type": "long"
            },
            {
              "name": "status",
              "type": "int"
            }
          ]
        }
      }
    },
    {
      "name": "asks",
      "type": {
        "type": "array",
        "items": {
          "name": "AskQuote",
          "type": "record",
          "fields": [
            {
              "name": "rate",
              "type": "double"
            },
            {
              "name": "liquidity",
              "type": "double"
            },
            {
              "name": "time",
              "type": "long"
            },
            {
              "name": "status",
              "type": "int"
            }
          ]
        }
      }
    }
  ]
}

当我想反序列化它时，我使用自定义 Java 方法，例如：

public static List<Quotes> deserialize(byte[] avroBytes)
            throws IOException, InvocationTargetException {
        ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(avroBytes);
        byteArrayInputStream.reset();
        BinaryDecoder binaryDecoder = DecoderFactory.get().binaryDecoder(byteArrayInputStream, null);

        SpecificDatumReader<Quotes> quoteDatumReader = new SpecificDatumReader<>(Quotes.class);
        List<Quotes> sparkQuotes = new ArrayList<>();

        while (!binaryDecoder.isEnd()) {
            Quotes read = quoteDatumReader.read(null, binaryDecoder);
            sparkQuotes.add(read);
        }
        return sparkQuotes;
    }

这可以正常工作。

当我尝试在 spark 执行中反序列化文件时，我面临的问题发生，例如在以下代码中：

// I need the hadoopFile method in my context that isn't expressed in this snippet.
JavaPairRDD<Text, Text> textTextJavaPairRDD =
                javaSparkContext.hadoopFile(
                        "Quotes.avro",
                        KeyValueTextInputFormat.class, Text.class, Text.class);

JavaRDD<byte[]> avroByteRDD = textTextJavaPairRDD.map(new Function<Tuple2<Text, Text>, byte[]>() {
    @Override
    public byte[] call(Tuple2<Text, Text> hdfsTextTextRow) throws Exception {
        return hdfsTextTextRow._1.copyBytes();
    }
});

JavaRDD<Quotes> quotesRDD = avroQuoteRdd.flatMap(new FlatMapFunction<byte[], Quotes>() {
    @Override
    public Iterable<Quotes> call(byte[] keyAvroByte) throws Exception {
        return deserialize(avroByteRDD);
    }
});

quotesRDD.collect();

所以当我执行这个时，我有异常：

<块引用>

警告 TaskSetManager：在阶段 0.0 中丢失任务 0.0（TID 0，本地主机）：java.io.EOFException

通过在我的反序列化方法中调用 avro 方法 read 引发。

为什么它会在 spark 执行之外工作，但当我尝试在 spark 内部执行相同操作时却对同一个文件失败。

为什么 avro 反序列化仅在 spark 执行中失败？

0 个答案: