如何使用Apache Beam(KafkaIO)反序列化Avro数据

时间:2019-09-13 08:14:04

标签: java apache-kafka avro apache-beam confluent-schema-registry

我只看到一个线程,其中包含有关我提到的主题的信息: How to Deserialising Kafka AVRO messages using Apache Beam

但是,在尝试了几种kafkaserializer之后,我仍然无法反序列化kafka消息。这是我的代码:

public class Readkafka {
    private static final Logger LOG = LoggerFactory.getLogger(Readkafka.class);

    public static void main(String[] args) throws IOException {
        // Create the Pipeline object with the options we defined above.
        Pipeline p = Pipeline.create(
                PipelineOptionsFactory.fromArgs(args).withValidation().create());
       PTransform<PBegin, PCollection<KV<action_states_pkey, String>>> kafka =
                KafkaIO.<action_states_pkey, String>read()
                    .withBootstrapServers("mybootstrapserver")
                    .withTopic("action_States")
                    .withKeyDeserializer(MyClassKafkaAvroDeserializer.class)
                    .withValueDeserializer(StringDeserializer.class)
                    .updateConsumerProperties(ImmutableMap.of("schema.registry.url", (Object)"schemaregistryurl"))
                    .withMaxNumRecords(5)
                    .withoutMetadata();


        p.apply(kafka)
            .apply(Keys.<action_states_pkey>create())
}

MyClassKafkaAvroDeserilizer所在的地方

public class MyClassKafkaAvroDeserializer extends
AbstractKafkaAvroDeserializer implements Deserializer<action_states_pkey> {

@Override
public void configure(Map<String, ?> configs, boolean isKey) {
    configure(new KafkaAvroDeserializerConfig(configs));
}

@Override
public action_states_pkey deserialize(String s, byte[] bytes) {
    return (action_states_pkey) this.deserialize(bytes);
}

@Override
public void close() {} }

和action_states_pkey类是使用

从avro工具生成的代码
java -jar pathtoavrotools/avro-tools-1.8.1.jar compile schema pathtoschema/action_states_pkey.avsc destination path

action_states_pkey.avsc实际上是

{"type":"record","name":"action_states_pkey","namespace":"namespace","fields":[{"name":"ad_id","type":["null","int"]},{"name":"action_id","type":["null","int"]},{"name":"state_id","type":["null","int"]}]}

使用此代码,我得到了错误:

Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to my.mudah.beam.test.action_states_pkey
    at my.mudah.beam.test.MyClassKafkaAvroDeserializer.deserialize(MyClassKafkaAvroDeserializer.java:20)
    at my.mudah.beam.test.MyClassKafkaAvroDeserializer.deserialize(MyClassKafkaAvroDeserializer.java:1)
    at org.apache.beam.sdk.io.kafka.KafkaUnboundedReader.advance(KafkaUnboundedReader.java:221)
    at org.apache.beam.sdk.io.BoundedReadFromUnboundedSource$UnboundedToBoundedSourceAdapter$Reader.advanceWithBackoff(BoundedReadFromUnboundedSource.java:279)
    at org.apache.beam.sdk.io.BoundedReadFromUnboundedSource$UnboundedToBoundedSourceAdapter$Reader.start(BoundedReadFromUnboundedSource.java:256)
    at com.google.cloud.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:592)
    ... 14 more

尝试将Avro数据映射到我的自定义类似乎出现错误?

或者,我尝试了以下代码:

        PTransform<PBegin, PCollection<KV<action_states_pkey, String>>> kafka =
                KafkaIO.<action_states_pkey, String>read()
                    .withBootstrapServers("bootstrapserver")
                    .withTopic("action_states")
                    .withKeyDeserializerAndCoder((Class)KafkaAvroDeserializer.class, AvroCoder.of(action_states_pkey.class))
                    .withValueDeserializer(StringDeserializer.class)
                    .updateConsumerProperties(ImmutableMap.of("schema.registry.url", (Object)"schemaregistry"))
                    .withMaxNumRecords(5)
                    .withoutMetadata();


        p.apply(kafka);
            .apply(Keys.<action_states_pkey>create())
//            .apply("ExtractWords", ParDo.of(new DoFn<action_states_pkey, String>() {
//                @ProcessElement
//                public void processElement(ProcessContext c) {
//                  action_states_pkey key = c.element();
//                    c.output(key.getAdId().toString());
//                }
//            }));

在我尝试打印数据之前不会给我任何错误。我必须验证我是否以一种或另一种方式成功读取了数据,所以我的目的是在控制台中记录数据。如果我取消注释部分,则再次出现相同的错误:

SEVERE: 2019-09-13T07:53:56.168Z: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to my.mudah.beam.test.action_states_pkey
    at my.mudah.beam.test.Readkafka$1.processElement(Readkafka.java:151)

要注意的另一件事是,如果我指定:

.updateConsumerProperties(ImmutableMap.of("specific.avro.reader", (Object)"true"))

总是给我一个错误

Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id 443
Caused by: org.apache.kafka.common.errors.SerializationException: Could not find class NAMESPACE.action_states_pkey specified in writer's schema whilst finding reader's schema for a SpecificRecord.

看来我的方法有问题吗? 如果有人有使用Apache Beam从Kafka Streams读取AVRO数据的经验,请帮帮我。非常感谢。

这是我的包的快照,其中也包含模式和类: package/working path details

谢谢。

1 个答案:

答案 0 :(得分:0)

  

公共类MyClassKafkaAvroDeserializer扩展    AbstractKafkaAvroDeserializer

您的班级正在扩展AbstractKafkaAvroDeserializer并返回GenericRecord

您需要convert the GenericRecord to your custom object

OR

为此使用SpecificRecord,如以下答案之一所述:

/**
 * Extends deserializer to support ReflectData.
 *
 * @param <V>
 *     value type
 */
public abstract class ReflectKafkaAvroDeserializer<V> extends KafkaAvroDeserializer {

  private Schema readerSchema;
  private DecoderFactory decoderFactory = DecoderFactory.get();

  protected ReflectKafkaAvroDeserializer(Class<V> type) {
    readerSchema = ReflectData.get().getSchema(type);
  }

  @Override
  protected Object deserialize(
      boolean includeSchemaAndVersion,
      String topic,
      Boolean isKey,
      byte[] payload,
      Schema readerSchemaIgnored) throws SerializationException {

    if (payload == null) {
      return null;
    }

    int schemaId = -1;
    try {
      ByteBuffer buffer = ByteBuffer.wrap(payload);
      if (buffer.get() != MAGIC_BYTE) {
        throw new SerializationException("Unknown magic byte!");
      }

      schemaId = buffer.getInt();
      Schema writerSchema = schemaRegistry.getByID(schemaId);

      int start = buffer.position() + buffer.arrayOffset();
      int length = buffer.limit() - 1 - idSize;
      DatumReader<Object> reader = new ReflectDatumReader(writerSchema, readerSchema);
      BinaryDecoder decoder = decoderFactory.binaryDecoder(buffer.array(), start, length, null);
      return reader.read(null, decoder);
    } catch (IOException e) {
      throw new SerializationException("Error deserializing Avro message for id " + schemaId, e);
    } catch (RestClientException e) {
      throw new SerializationException("Error retrieving Avro schema for id " + schemaId, e);
    }
  }
}

以上内容是从https://stackoverflow.com/a/39617120/2534090

复制而来的

https://stackoverflow.com/a/42514352/2534090