在数据流上运行Apache Beam管道会引发错误(DirectRunner运行没有问题)

时间:2018-08-22 15:44:42

标签: python google-cloud-dataflow apache-beam

使用数据流时,运行完美的管道会引发错误。所以我尝试了一个简单的管道,并得到了相同的错误。

同一管道将在DirectRunner上顺利运行。 执行环境是一个Google数据实验室。

请让我知道我的环境中是否需要任何更改/更新或其他建议?

非常感谢, e

Instructor.forEach(instance =>{
  if(instance.ID !== InstructorInstance.ID){
    Instructor.push(InstructorInstance);
  }else{
    console.log('Duplicate')
  }
})

将引发以下错误:

import  apache_beam  as  beam
options = PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = 'PROJECT-ID'
google_cloud_options.job_name = 'try-debug'
google_cloud_options.staging_location = '%s/staging' % BUCKET_URL #'gs://archs4/staging'
google_cloud_options.temp_location = '%s/tmp' % BUCKET_URL #'gs://archs4/temp'
options.view_as(StandardOptions).runner = 'DataflowRunner'  

p1 = beam.Pipeline(options=options)

(p1 | 'read' >> beam.io.ReadFromText('gs://dataflow-samples/shakespeare/kinglear.txt')
    | 'write' >> beam.io.WriteToText('gs://bucket/test.txt', num_shards=1)
 )

p1.run().wait_until_finish()

1 个答案:

答案 0 :(得分:2)

我可以用boost::this_thread::get_id()来完成您的工作,而Jupyter笔记本(不是Datalab本身)没有任何问题。

在撰写本文时,我正在使用DataflowRunner Python SDK的最新版本(v2.6.0)。您可以重试v2.6.0而不是v2.0.0吗?

这是我跑的东西

apache_beam[gcp]

这是它运行的证明: enter image description here

该作业失败了,正如预期的那样,因为我没有对import apache_beam as beam from apache_beam.pipeline import PipelineOptions from apache_beam.options.pipeline_options import GoogleCloudOptions from apache_beam.options.pipeline_options import StandardOptions BUCKET_URL = "gs://YOUR_BUCKET_HERE/test" import os os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'PATH_TO_YOUR_SERVICE_ACCOUNT_JSON_CREDS' options = PipelineOptions() google_cloud_options = options.view_as(GoogleCloudOptions) google_cloud_options.project = 'YOUR_PROJECT_ID_HERE' google_cloud_options.job_name = 'try-debug' google_cloud_options.staging_location = '%s/staging' % BUCKET_URL #'gs://archs4/staging' google_cloud_options.temp_location = '%s/tmp' % BUCKET_URL #'gs://archs4/temp' options.view_as(StandardOptions).runner = 'DataflowRunner' p1 = beam.Pipeline(options=options) (p1 | 'read' >> beam.io.ReadFromText('gs://dataflow-samples/shakespeare/kinglear.txt') | 'write' >> beam.io.WriteToText('gs://bucket/test.txt', num_shards=1) ) p1.run().wait_until_finish() 的写权限-您还可以在屏幕截图左下方的stacktrace中看到它。但是,该作业已成功提交到Google Cloud Dataflow,并且运行了。