我正在尝试在Flink集群上运行Apache Beam应用程序,但由于翻译Kafka UnboundedSource错误而失败,并显示df_last_week = df.loc[df['date'] >= '2019-08-17']
。该应用程序是一个字数示例,可从Kafka主题读取并发布到Kafka主题,并且使用Beam的直接运行器可以正常工作。
我通过遵循Beam的QuickStart Java创建了pom.xml,然后添加了KafkaIO sdk。我正在运行一个单节点本地Flink 1.8.1集群和Kafka 2.3.0。
pom.xml代码段
[partitions type:ARRAY pos:0] is not serializable
KafkaWordCount.java代码段
<properties>
<beam.version>2.14.0</beam.version>
<flink.artifact.name>beam-runners-flink-1.8</flink.artifact.name>
<flink.version>1.8.1</flink.version>
</properties>
...
<profile>
<id>flink-runner</id>
<!-- Makes the FlinkRunner available when running a pipeline. -->
<dependencies>
<dependency>
<groupId>org.apache.beam</groupId>
<!-- Please see the Flink Runner page for an up-to-date list
of supported Flink versions and their artifact names:
https://beam.apache.org/documentation/runners/flink/ -->
<artifactId>${flink.artifact.name}</artifactId>
<version>${beam.version}</version>
<scope>runtime</scope>
</dependency>
<!-- Tried with and without this flink-avro dependency -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-avro</artifactId>
<version>${flink.version}</version>
</dependency>
</dependencies>
</profile>
...
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-io-kafka</artifactId>
<version>${beam.version}</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.3.0</version>
</dependency>
完整的错误消息,是通过 // Create the Pipeline object with the options we defined above.
Pipeline p = Pipeline.create(options);
PCollection<KV<String, Long>> counts = p.apply(KafkaIO.<String, String>read()
.withBootstrapServers(options.getBootstrapServer())
.withTopics(Collections.singletonList(options.getInputTopic()))
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.updateConsumerProperties(ImmutableMap.of("auto.offset.reset", (Object)"latest"))
.withoutMetadata() // PCollection<KV<Long, String>> instead of KafkaRecord type
)
将Beam jar提交给Flink的结果
/opt/flink/bin/flink run -c org.apache.beam.examples.KafkaWordCount target/word-count-beam-bundled-0.1.jar --runner=FlinkRunner --bootstrapServer=localhost:9092
更新
结果证明,Beam中存在一个与在Flink上运行有关的问题,该问题似乎与此有关:https://issues.apache.org/jira/browse/BEAM-7478。关于它的评论之一特别提到,由于Avro的Schema.Field无法序列化,因此无法将flink / run与KafkaIO一起使用:https://issues.apache.org/jira/browse/BEAM-7478?focusedCommentId=16902419&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16902419
更新2
如评论中所述,一种解决方法是将Flink降级为1.8.0。