从Kafka读取的Apache Beam给出了CoderException:java.io.EOFException

时间:2017-04-04 14:12:34

标签: apache-kafka google-cloud-dataflow apache-beam

我已根据此处的文档实施了从Kafka读取的Beam管道:https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L125

管道本身适用于有界源,我有测试用例可以从文件读取而没有问题。

从Kafka读取的代码非常简单,与示例基本相同:

    PCollection<String> input = p.apply(KafkaIO.<Long, String>read()
                                        .withBootstrapServers(KAFKA_BROKER)
                                        .withTopics(Arrays.asList(KAFKA_READ_TOPIC))
                                        .withKeyCoder(BigEndianLongCoder.of())
                                        .withValueCoder(StringUtf8Coder.of())
                                        .withTimestampFn(new TimestampKafkaStrings())
                                        .withoutMetadata())
    .apply(Values.<String>create());

应用程序启动正常,似乎连接到Kafka。 但是,只要我从Kafka写入另一个进程并且管道开始读取,我就会在第一次读取时收到以下异常:

INFO: Kafka version : 0.10.2.0
Apr 04, 2017 9:46:18 AM org.apache.kafka.common.utils.AppInfoParser$AppInfo <init>
INFO: Kafka commitId : 576d93a8dc0cf421
Apr 04, 2017 9:46:30 AM org.apache.beam.sdk.io.kafka.KafkaIO$UnboundedKafkaReader advance
INFO: Reader-0: first record offset 2000
Apr 04, 2017 9:46:30 AM org.apache.beam.sdk.io.kafka.KafkaIO$UnboundedKafkaReader consumerPollLoop
INFO: Reader-0: Returning from consumer pool loop
[WARNING] 
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:293)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: org.apache.beam.sdk.coders.CoderException: java.io.EOFException
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:453)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:350)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:71)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:210)
at com.groupbyinc.beam.SessionRollup.main(SessionRollup.java:186)
... 6 more
Caused by: org.apache.beam.sdk.coders.CoderException: java.io.EOFException
at org.apache.beam.sdk.coders.BigEndianLongCoder.decode(BigEndianLongCoder.java:64)
at org.apache.beam.sdk.coders.BigEndianLongCoder.decode(BigEndianLongCoder.java:33)
at org.apache.beam.sdk.io.kafka.KafkaIO$UnboundedKafkaReader.decode(KafkaIO.java:1018)
at org.apache.beam.sdk.io.kafka.KafkaIO$UnboundedKafkaReader.advance(KafkaIO.java:989)
at org.apache.beam.runners.direct.UnboundedReadEvaluatorFactory$UnboundedReadEvaluator.startReader(UnboundedReadEvaluatorFactory.java:190)
at org.apache.beam.runners.direct.UnboundedReadEvaluatorFactory$UnboundedReadEvaluator.processElement(UnboundedReadEvaluatorFactory.java:128)
at org.apache.beam.runners.direct.TransformExecutor.processElements(TransformExecutor.java:139)
at org.apache.beam.runners.direct.TransformExecutor.run(TransformExecutor.java:107)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more

密钥解码器尝试读取Kafka消息密钥的方式似乎有问题。在源数据中,这些键没有明确设置,所以我假设它们默认为Kafka(?)中的时间戳。

有关如何进一步调试此问题的任何想法?或者我可以看一下资源?功能实例?

编辑:删除管道的.withTimestampFn()部分无效。代码似乎在它到达那一点之前就失败了。

1 个答案:

答案 0 :(得分:2)

答案是关键不长。似乎默认情况下,密钥是随机散列,即buildscript { dependencies { classpath 'com.android.tools.build:gradle:2.3.0' classpath 'com.google.gms:google-services:3.0.0' } } dependencies { compile 'com.android.support:appcompat-v7:25.3.0' compile 'com.android.support:design:25.3.0' compile 'com.android.support:support-v4:25.3.0' compile 'com.google.android.gms:play-services:10.2.1' compile 'com.google.firebase:firebase-appindexing:10.2.1' compile 'com.google.android.gms:play-services-analytics:10.2.1' } apply plugin: 'com.google.gms.google-services' 。奇怪的是,Beam KafkaIO库无法开箱即用地处理默认的Kafka用例。

所以我的理论是,当String尝试解码该值时,它会触及EOF,因为long比char更大,因此在它认为读取之前它就会耗尽所有内容足够长的东西。

所以我的固定代码如下:

BigEndianLongCoder

重要的细节是调用PCollection<String> input = p.apply(KafkaIO.<Long, String>readBytes() .withBootstrapServers(KAFKA_BROKER) .withTopics(Arrays.asList(KAFKA_READ_TOPIC)) .withTimestampFn(new TimestampKafkaStrings()) .withoutMetadata()) .apply(Values.<byte[]>create()) .apply(ParDo.of(new BytesToString())); 而不是readBytes(),然后自己将字节解析为字符串。

就我而言,之后我遇到了另一个问题,因为正在读取的字符串是来自Node进程的字符串化JSON。出于某种原因,杰克逊无法处理从卡夫卡进来的逃脱的JSON,因此必须首先进行转义,然后进行解析。

所有这些都指出了Apache Beam KafkaIO库中的弱点。为其使用而提供的示例不准确,并且在简单的默认情况下不起作用。此外,因为它是如此新颖,很少有人在网上使用它的例子,所以当你遇到问题时找到解决方案可能很有挑战性。

我应该提交一个拉动请求,并提供更好的示例。