我已根据此处的文档实施了从Kafka读取的Beam管道:https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L125
管道本身适用于有界源,我有测试用例可以从文件读取而没有问题。
从Kafka读取的代码非常简单,与示例基本相同:
PCollection<String> input = p.apply(KafkaIO.<Long, String>read()
.withBootstrapServers(KAFKA_BROKER)
.withTopics(Arrays.asList(KAFKA_READ_TOPIC))
.withKeyCoder(BigEndianLongCoder.of())
.withValueCoder(StringUtf8Coder.of())
.withTimestampFn(new TimestampKafkaStrings())
.withoutMetadata())
.apply(Values.<String>create());
应用程序启动正常,似乎连接到Kafka。 但是,只要我从Kafka写入另一个进程并且管道开始读取,我就会在第一次读取时收到以下异常:
INFO: Kafka version : 0.10.2.0
Apr 04, 2017 9:46:18 AM org.apache.kafka.common.utils.AppInfoParser$AppInfo <init>
INFO: Kafka commitId : 576d93a8dc0cf421
Apr 04, 2017 9:46:30 AM org.apache.beam.sdk.io.kafka.KafkaIO$UnboundedKafkaReader advance
INFO: Reader-0: first record offset 2000
Apr 04, 2017 9:46:30 AM org.apache.beam.sdk.io.kafka.KafkaIO$UnboundedKafkaReader consumerPollLoop
INFO: Reader-0: Returning from consumer pool loop
[WARNING]
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:293)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: org.apache.beam.sdk.coders.CoderException: java.io.EOFException
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:453)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:350)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:71)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:210)
at com.groupbyinc.beam.SessionRollup.main(SessionRollup.java:186)
... 6 more
Caused by: org.apache.beam.sdk.coders.CoderException: java.io.EOFException
at org.apache.beam.sdk.coders.BigEndianLongCoder.decode(BigEndianLongCoder.java:64)
at org.apache.beam.sdk.coders.BigEndianLongCoder.decode(BigEndianLongCoder.java:33)
at org.apache.beam.sdk.io.kafka.KafkaIO$UnboundedKafkaReader.decode(KafkaIO.java:1018)
at org.apache.beam.sdk.io.kafka.KafkaIO$UnboundedKafkaReader.advance(KafkaIO.java:989)
at org.apache.beam.runners.direct.UnboundedReadEvaluatorFactory$UnboundedReadEvaluator.startReader(UnboundedReadEvaluatorFactory.java:190)
at org.apache.beam.runners.direct.UnboundedReadEvaluatorFactory$UnboundedReadEvaluator.processElement(UnboundedReadEvaluatorFactory.java:128)
at org.apache.beam.runners.direct.TransformExecutor.processElements(TransformExecutor.java:139)
at org.apache.beam.runners.direct.TransformExecutor.run(TransformExecutor.java:107)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
密钥解码器尝试读取Kafka消息密钥的方式似乎有问题。在源数据中,这些键没有明确设置,所以我假设它们默认为Kafka(?)中的时间戳。
有关如何进一步调试此问题的任何想法?或者我可以看一下资源?功能实例?
编辑:删除管道的.withTimestampFn()
部分无效。代码似乎在它到达那一点之前就失败了。
答案 0 :(得分:2)
答案是关键不长。似乎默认情况下,密钥是随机散列,即buildscript {
dependencies {
classpath 'com.android.tools.build:gradle:2.3.0'
classpath 'com.google.gms:google-services:3.0.0'
}
}
dependencies {
compile 'com.android.support:appcompat-v7:25.3.0'
compile 'com.android.support:design:25.3.0'
compile 'com.android.support:support-v4:25.3.0'
compile 'com.google.android.gms:play-services:10.2.1'
compile 'com.google.firebase:firebase-appindexing:10.2.1'
compile 'com.google.android.gms:play-services-analytics:10.2.1'
}
apply plugin: 'com.google.gms.google-services'
。奇怪的是,Beam KafkaIO库无法开箱即用地处理默认的Kafka用例。
所以我的理论是,当String
尝试解码该值时,它会触及EOF,因为long比char更大,因此在它认为读取之前它就会耗尽所有内容足够长的东西。
所以我的固定代码如下:
BigEndianLongCoder
重要的细节是调用PCollection<String> input = p.apply(KafkaIO.<Long, String>readBytes()
.withBootstrapServers(KAFKA_BROKER)
.withTopics(Arrays.asList(KAFKA_READ_TOPIC))
.withTimestampFn(new TimestampKafkaStrings())
.withoutMetadata())
.apply(Values.<byte[]>create())
.apply(ParDo.of(new BytesToString()));
而不是readBytes()
,然后自己将字节解析为字符串。
就我而言,之后我遇到了另一个问题,因为正在读取的字符串是来自Node进程的字符串化JSON。出于某种原因,杰克逊无法处理从卡夫卡进来的逃脱的JSON,因此必须首先进行转义,然后进行解析。
所有这些都指出了Apache Beam KafkaIO库中的弱点。为其使用而提供的示例不准确,并且在简单的默认情况下不起作用。此外,因为它是如此新颖,很少有人在网上使用它的例子,所以当你遇到问题时找到解决方案可能很有挑战性。
我应该提交一个拉动请求,并提供更好的示例。