Question

您好我已经创建了一个apache beam管道，测试它并从eclipse内部运行它，本地和使用dataflow runner。我可以在eclipse控制台中看到管道正在运行我也看到了细节，i。即登录控制台。

现在，我如何将此管道部署到GCP，以便无论我的机器状态如何，它都能继续工作。例如，如果我使用mvn编译exec运行它：java控制台显示它正在运行，但我无法使用数据流UI找到该作业。

此外，如果我在本地终止进程会发生什么，GCP基础架构上的工作是否也会停止？我如何知道在GCP基础设施上独立于我的机器状态触发了作业？

maven编译exec：java with arguments输出如下，

 SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in 
    [jar:file:/C:/Users/ThakurG/.m2/repository/org/slf4j/slf4j-
    jdk14/1.7.14/slf4j-jdk14-1.7.14.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/C:/Users/ThakurG/.m2/repository/org/slf4j/slf4j-nop/1.7.25/slf4j-nop-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.JDK14LoggerFactory]
    Jan 08, 2018 5:33:22 PM com.trial.apps.gcp.df.ReceiveAndPersistToBQ main
    INFO: starting the process...
    Jan 08, 2018 5:33:25 PM com.trial.apps.gcp.df.ReceiveAndPersistToBQ 
   createStream
    INFO: pipeline created::Pipeline#73387971
    Jan 08, 2018 5:33:27 PM com.trial.apps.gcp.df.ReceiveAndPersistToBQ main
    INFO: pie crated::Pipeline#73387971
    Jan 08, 2018 5:54:57 PM com.trial.apps.gcp.df.ReceiveAndPersistToBQ$1 apply
    INFO: Message received::1884408,16/09/2017,A,2007156,CLARK RUBBER FRANCHISING PTY LTD,A ,5075,6,Y,296,40467910,-34.868095,138.683535,66 SILKES RD,,,PARADISE,5075,0,7.4,5.6,18/09/2017 2:09,0.22
    Jan 08, 2018 5:54:57 PM com.trial.apps.gcp.df.ReceiveAndPersistToBQ$1 apply
    INFO: Payload from msg::1884408,16/09/2017,A,2007156,CLARK RUBBER FRANCHISING PTY LTD,A ,5075,6,Y,296,40467910,-34.868095,138.683535,66 SILKES RD,,,PARADISE,5075,0,7.4,5.6,18/09/2017 2:09,0.22
    Jan 08, 2018 5:54:57 PM com.trial.apps.gcp.df.ReceiveAndPersistToBQ$1 apply

这是我在cmd提示符下使用的maven命令，

`mvn compile exec:java -Dexec.mainClass=com.trial.apps.gcp.df.ReceiveAndPersistToBQ -Dexec.args="--project=analyticspoc-XXX --stagingLocation=gs://analytics_poc_staging --runner=DataflowRunner --streaming=true"`

这是我用来创建管道并在相同位置设置选项的代码片段。

PipelineOptions options = PipelineOptionsFactory.create();

DataflowPipelineOptions dfOptions = options.as(DataflowPipelineOptions.class);
dfOptions.setRunner(DataflowRunner.class);
dfOptions.setJobName("gcpgteclipse");
dfOptions.setStreaming(true);

// Then create the pipeline.
Pipeline pipeL = Pipeline.create(dfOptions);

Answer 1

你能澄清一下你的意思是什么？＃34;控制台显示它正在运行＆＃34;并且＃＆＃34;无法使用Dataflow UI＆＃34;？

找到作业

如果您的节目输出打印信息：

To access the Dataflow monitoring console, please navigate to https://console.developers.google.com/project/.../dataflow/job/....

然后您的作业正在Dataflow服务上运行。一旦它运行，杀死主程序将不会停止作业 - 所有主程序都会定期轮询Dataflow服务以获取作业状态和新日志消息。在打印链接之后，您将转到Dataflow UI。

如果未打印此消息，那么在实际启动Dataflow作业之前，您的程序可能会卡在某处。如果包含程序的输出，这将有助于调试。

Answer 2

要部署要由Dataflow执行的管道，可以通过命令行或runner类指定project和DataflowPipelineOptions执行参数。 runner必须设置为DataflowRunner（Apache Beam 2.x.x），project设置为您的GCP项目ID。见Specifying Execution Parameters。如果您在数据流作业UI列表中没有看到该作业，那么它肯定不会在Dataflow中运行。

如果您终止将作业部署到Dataflow的进程，则作业将继续在Dataflow中运行。它不会停止。

这是微不足道的，但为了绝对清楚，您必须在run()对象上调用Pipeline才能执行它（因此部署到Dataflow）。 run()的返回值是PipelineResult对象，其中包含用于确定作业状态的各种方法。例如，您可以调用pipeline.run().waitUntilFinish();强制程序阻止执行，直到作业完成。如果您的程序被阻止，那么您就知道该作业已被触发。有关所有可用方法，请参阅Apache Beam Java SDK文档的PipelineResult部分。

Apache Beam在DataFlow上部署

2 个答案: