使用apache光束在实木复合地板中写入protobuf对象

时间:2018-09-18 09:04:50

标签: protocol-buffers parquet apache-beam

我从google pub / sub获取protobuf数据,并将数据反序列化为Message类型对象。所以我得到PCollection<Message>类型的对象。这是示例代码:

public class ProcessPubsubMessage extends DoFn<PubsubMessage, Message> {

    @ProcessElement
    public void processElement(@Element PubsubMessage element, OutputReceiver<Message> receiver) {

        byte[] payload = element.getPayload();
        try {
            Message message = Message.parseFrom(payload);
            receiver.output(message);
        } catch (InvalidProtocolBufferException e) {
            LOG.error("Got exception while parsing message from pubsub. Exception =>" + e.getMessage());
        }

    }
}
PCollection<Message> event = psMessage.apply("Parsing data from pubsub message",
                ParDo.of(new ProcessPubsubMessage()));

我想在PCollection<Message> event上应用转换以拼写格式书写。我知道apache beam已经提供了 ParquetIO ,但是它对于PCollection<GenericRecord>类型的工作正常,并且从MessageGenericRecord的转换可能解决了这个问题(但还不知道如何要做到这一点)。有什么简单的方法可以拼花格式书写?

1 个答案:

答案 0 :(得分:0)

可以通过使用以下库来解决:

<dependency>
     <groupId>org.apache.avro</groupId>
     <artifactId>avro-protobuf</artifactId>
     <version>1.7.7</version>
</dependency>

private GenericRecord getGenericRecord(Event event) throws IOException {
    ProtobufDatumWriter<Event> datumWriter = new ProtobufDatumWriter<Event>(Event.class);
    ByteArrayOutputStream os = new ByteArrayOutputStream();
    Encoder e = EncoderFactory.get().binaryEncoder(os, null);
    datumWriter.write(event, e);
    e.flush();

    ProtobufDatumReader<Event> datumReader = new ProtobufDatumReader<Event>(Event.class);
    GenericDatumReader<GenericRecord> genericDatumReader = new GenericDatumReader<GenericRecord>(datumReader.getSchema());
    GenericRecord record = genericDatumReader.read(null, DecoderFactory.get().binaryDecoder(new ByteArrayInputStream(os.toByteArray()), null));
    return record;
}

有关详细信息:https://gist.github.com/alexvictoor/1d3937f502c60318071f