如何激活Dataflow Shuffle服务?

时间:2017-11-30 11:28:54

标签: python google-cloud-dataflow

我正在尝试在python环境中使用Dataflow Shuffle service,但似乎shuffle服务无法正常工作,如下所示。

console output

我将SDK版本设置为2.1以上,区域为us-central1。 我想我们可以通过添加实验选项激活Dataflow Shuffle服务,我错过了什么?

以下是我测试的代码,您可以重现这种现象。

import apache_beam as beam

options = beam.options.pipeline_options.PipelineOptions()
gcloud_options = options.view_as(
    beam.options.pipeline_options.GoogleCloudOptions)
gcloud_options.job_name = 'dataflow-shuffle-test'
gcloud_options.project = 'PROJECTID'
gcloud_options.staging_location = 'gs://BUCKET/staging'
gcloud_options.temp_location = 'gs://BUCKET/temp'

# maybe this is the wrong way?
debug_options = options.view_as(beam.options.pipeline_options.DebugOptions)
debug_options.experiments = 'shuffle_mode=service'

worker_options = options.view_as(beam.options.pipeline_options.WorkerOptions)
worker_options.disk_size_gb = 20
worker_options.max_num_workers = 2

options.view_as(beam.options.pipeline_options.StandardOptions).runner = 'DataflowRunner'

def modify_data2(kvpair):
    return {'name': kvpair[0],
            'sum': sum(kvpair[1])
            }


p = beam.Pipeline(options=options)

query = 'SELECT * FROM [bigquery-public-data:usa_names.usa_1910_current]'
(p | 'read' >> beam.io.Read(beam.io.BigQuerySource(project='PROJECTID', 
                                                   use_standard_sql=False, 
                                                   query=query))
   | 'pair' >> beam.Map(lambda x: (x['name'], x['number']))
   | "groupby" >> beam.GroupByKey()
   | 'modify' >> beam.Map(modify_data2)
   | 'write' >> beam.io.WriteToText('gs://BUCKET/test.txt', num_shards=1)
 )

p.run()

作业成功完成,没有任何错误。任何评论都会有所帮助!

修改
感谢谢尔盖的回答,我发现了自己的错误。我误解的是实验选项。设置实验选项如下。

# set as list, instead of string.
debug_options.experiments = ['shuffle_mode=service']

此外,我还希望运行一个带有shuffle服务的简单管道,笔记本可以在Datalab上运行。 https://gist.github.com/hayatoy/f6664f965a2519ec406e11235faf75b6

1 个答案:

答案 0 :(得分:1)

@HayatoY,只需指定实验标志(--experiments shuffle_mode = service)即可。

Dataflow Shuffle Service随Python SDK一起提供,从us-central1和europe-west1区域的2.1版本开始。

您是否可以检查UI中“作业详细信息”页面中“管道选项”窗格下的实验行? (见我的截图)

我刚刚从命令行启动了一个简单的wordcount管道,并验证了它使用了Shuffle(度量标准为0,但这是正常的,因为wordcount管道使用非常少的shuffle)。只要度量标准不是“ - ”,就可以证明使用了随机服务。

python -m apache_beam.examples.wordcount \   --project $ PROJECT_ID \   --runner DataflowRunner \   --staging_location $ BUCKET / staging \   --temp_location $ BUCKET / temp \   --output $ BUCKET / output \   --experiments shuffle_mode = service

screenshot of a python pipeline with shuffle