我试图在数据流中使用apache beam消费kafka消息。我使用apache beam 2.1.0版编写了一个简单的管道:
public static void main(String[] args) {
DrainOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(DrainOptions.class);
options.setStreaming(true);
Pipeline p = Pipeline.create(options);
Map<String, Object> props = new HashMap<>();
props.put("auto.offset.reset", "latest");
props.put("group.id", "test-group");
p.apply(KafkaIO.readBytes()
.updateConsumerProperties(props)
.withTopic(options.getTopic())
.withBootstrapServers(options.getBootstrapServer())
).apply(ParDo.of(new GetValue()))
.apply("ToString", ParDo.of(new ToString()))
.apply("FixedWindow", Window.<String>into(FixedWindows.of(Duration.standardSeconds(30))))
.apply(TextIO.write().to(options.getOutput()).withWindowedWrites().withNumShards(1));
PipelineResult pipelineResult = p.run();
pipelineResult.waitUntilFinish();
}
当它尝试使用dataflow runner运行它时:
mvn compile exec:java -Dexec.mainClass=com.test.beamexample.Drain -Dexec.args="--project=my-project --gcpTempLocation=gs://my_bucket/tmp/drain --streaming=true --stagingLocation=gs://my_bucket/staging/drain --output=gs://my_bucket/output/staging/drainresult --bootstrapServer=kafka-broker:9092 --topic=test --runner=DataflowRunner" -Pdataflow-runner
管道已成功构建并上传到临时位置,但在Dataflow runner运行管道之前,它在本地执行,没有创建Dataflow作业,就像我们使用direct-runner时一样:
Nov 14, 2017 2:14:52 PM org.apache.beam.runners.dataflow.DataflowRunner fromOptions
INFO: PipelineOptions.filesToStage was not specified. Defaulting to files from the classpath: will stage 111 files. Enable logging at DEBUG level to see which files will be staged.
Nov 14, 2017 2:14:53 PM org.apache.beam.runners.dataflow.DataflowRunner run
INFO: Executing pipeline on the Dataflow Service, which will have billing implications related to Google Compute Engine usage and other Google Cloud Services.
Nov 14, 2017 2:14:53 PM org.apache.beam.runners.dataflow.util.PackageUtil stageClasspathElements
INFO: Uploading 111 files from PipelineOptions.filesToStage to staging location to prepare for execution.
Nov 14, 2017 2:14:59 PM org.apache.beam.runners.dataflow.util.PackageUtil stageClasspathElements
INFO: Staging files complete: 111 files cached, 0 files newly uploaded
Nov 14, 2017 2:15:00 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding KafkaIO.Read/Read(UnboundedKafkaSource)/DataflowRunner.StreamingUnboundedRead.ReadWithIds as step s1
Nov 14, 2017 2:15:00 PM org.apache.kafka.common.config.AbstractConfig logAll
INFO: ConsumerConfig values:
auto.commit.interval.ms = 5000
auto.offset.reset = latest
...
有什么遗失的吗?