无法在Dataflow上使用KafkaIO运行apache-beam管道

时间:2017-11-14 10:42:32

标签: apache-kafka google-cloud-dataflow apache-beam

我试图在数据流中使用apache beam消费kafka消息。我使用apache beam 2.1.0版编写了一个简单的管道:

public static void main(String[] args) {
    DrainOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(DrainOptions.class);
    options.setStreaming(true);

    Pipeline p = Pipeline.create(options);

    Map<String, Object> props = new HashMap<>();
    props.put("auto.offset.reset", "latest");
    props.put("group.id", "test-group");

    p.apply(KafkaIO.readBytes()
            .updateConsumerProperties(props)
            .withTopic(options.getTopic())
            .withBootstrapServers(options.getBootstrapServer())
    ).apply(ParDo.of(new GetValue()))
            .apply("ToString", ParDo.of(new ToString()))
            .apply("FixedWindow", Window.<String>into(FixedWindows.of(Duration.standardSeconds(30))))
            .apply(TextIO.write().to(options.getOutput()).withWindowedWrites().withNumShards(1));

    PipelineResult pipelineResult = p.run();
    pipelineResult.waitUntilFinish();
}

当它尝试使用dataflow runner运行它时:

mvn compile exec:java -Dexec.mainClass=com.test.beamexample.Drain -Dexec.args="--project=my-project --gcpTempLocation=gs://my_bucket/tmp/drain --streaming=true --stagingLocation=gs://my_bucket/staging/drain --output=gs://my_bucket/output/staging/drainresult --bootstrapServer=kafka-broker:9092 --topic=test --runner=DataflowRunner" -Pdataflow-runner

管道已成功构建并上传到临时位置,但在Dataflow runner运行管道之前,它在本地执行,没有创建Dataflow作业,就像我们使用direct-runner时一样:

Nov 14, 2017 2:14:52 PM org.apache.beam.runners.dataflow.DataflowRunner fromOptions
INFO: PipelineOptions.filesToStage was not specified. Defaulting to files from the classpath: will stage 111 files. Enable logging at DEBUG level to see which files will be staged.
Nov 14, 2017 2:14:53 PM org.apache.beam.runners.dataflow.DataflowRunner run
INFO: Executing pipeline on the Dataflow Service, which will have billing implications related to Google Compute Engine usage and other Google Cloud Services.
Nov 14, 2017 2:14:53 PM org.apache.beam.runners.dataflow.util.PackageUtil stageClasspathElements
INFO: Uploading 111 files from PipelineOptions.filesToStage to staging location to prepare for execution.
Nov 14, 2017 2:14:59 PM org.apache.beam.runners.dataflow.util.PackageUtil stageClasspathElements
INFO: Staging files complete: 111 files cached, 0 files newly uploaded
Nov 14, 2017 2:15:00 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding KafkaIO.Read/Read(UnboundedKafkaSource)/DataflowRunner.StreamingUnboundedRead.ReadWithIds as step s1
Nov 14, 2017 2:15:00 PM org.apache.kafka.common.config.AbstractConfig logAll
INFO: ConsumerConfig values: 
        auto.commit.interval.ms = 5000
        auto.offset.reset = latest
...

有什么遗失的吗?

0 个答案:

没有答案