Dataflow SparkPipelineRunner - any available examples?

时间:2015-10-06 08:14:24

标签: apache-spark google-cloud-platform google-cloud-dataflow

Does anybody have a working example(s) of using the Cloudera SparkPipielineRunner to execute (on a cluster) a pipeline written using the Dataflow SDK?

I can't see any in the Dataflow or Spark-Dataflow github repos.

We're trying to evaluate if running our pipelines on a Spark cluster will give us any performance gains over running them on the GCP Dataflow service.

1 个答案:

答案 0 :(得分:2)

在Beam站点使用Beam Spark Runner有一些示例:https://beam.apache.org/documentation/runners/spark/

您想要的依赖是:

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-runners-spark</artifactId>
  <version>0.3.0-incubating</version>
</dependency>

要针对独立群集运行,只需运行:

spark-submit --class com.beam.examples.BeamPipeline --master spark://HOST:PORT target/beam-examples-1.0.0-shaded.jar --runner=SparkRunner