Question

我想读取一个hdfs文件夹，其中包含带有spark的avro文件。然后，我想反序列化这些文件中包含的avro事件。我想在没有com.databrics库（或任何其他可以轻松实现的库）的情况下进行此操作。

问题是我在反序列化方面遇到困难。

我认为我的avro文件是用snappy压缩的，因为在文件的开头（在架构之后），我有

avro.codecsnappy

写。然后是可读或不可读的字符。

我对反序列化avro事件的第一次尝试是：

public static String deserialize(String message) throws IOException {
    Schema.Parser schemaParser = new Schema.Parser();
    Schema avroSchema = schemaParser.parse(defaultFlumeAvroSchema);

    DatumReader<GenericRecord> specificDatumReader = new SpecificDatumReader<GenericRecord>(avroSchema);

    byte[] messageBytes = message.getBytes();
    Decoder decoder = DecoderFactory.get().binaryDecoder(messageBytes, null);
    GenericRecord genericRecord = specificDatumReader.read(null, decoder);

    return genericRecord.toString();
}

当我要反序列化其中没有avro.codecsbappy的avro文件时，此函数有效。在这种情况下，我会出现错误：

格式错误的数据：长度为负数：-50

所以我尝试了另一种方法：

    private static void deserialize2(String path) throws IOException {
    DatumReader<GenericRecord> reader = new GenericDatumReader<>();
    DataFileReader<GenericRecord> fileReader =
            new DataFileReader<>(new File(path), reader);
    System.out.println(fileReader.getSchema().toString());

    GenericRecord record = new GenericData.Record(fileReader.getSchema());

    int numEvents = 0;
    while (fileReader.hasNext()) {
        fileReader.next(record);
        ByteBuffer body = (ByteBuffer) record.get("body");
        CharsetDecoder decoder = Charsets.UTF_8.newDecoder();
        System.out.println("Positon of the index " + body.position());
        System.out.println("Size of the array : " + body.array().length);
        String bodyStr = decoder.decode(body).toString();
        System.out.println("THE BODY STRING  ---> " bodyStr);
        numEvents++;
    }
    fileReader.close();
}

并返回以下输出：

索引为0的位置

数组的大小：127482

身体弦--->

我可以看到数组不是空的，但是它只是返回一个空字符串。

我该如何进行？

Answer 1

在转换为字符串时使用它：

String bodyStr = new String(body.array());
System.out.println("THE BODY STRING  ---> " + bodyStr);

来源：https://www.mkyong.com/java/how-do-convert-byte-array-to-string-in-java/

Answer 2

好吧，看来您情况不错。但是，您的ByteBuffer可能没有正确的byte[]数组要解码，因此让我们尝试以下方法：

byte[] bytes = new byte[body.remaining()];
buffer.get(bytes);
String result = new String(bytes, "UTF-8"); // Maybe you need to change charset

这应该可以工作，在问题中您已经证明ByteBuffer包含实际数据，如代码示例中所示，您可能必须更改字符集。

字符集列表：https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html

也有用：https://docs.oracle.com/javase/7/docs/api/java/nio/ByteBuffer.html

如何反序列化Avro文件

2 个答案: