bigquery DataFlow错误:在EU中读写时无法在不同位置读写

时间:2017-10-15 10:11:06

标签: python google-bigquery google-cloud-dataflow apache-beam

我有一个简单的Google DataFlow任务。它从BigQuery表读取并写入另一个,就像这样:

(p
 |  beam.io.Read( beam.io.BigQuerySource(
        query='select dia, import from DS1.t_27k where true', 
        use_standard_sql=True))
 |  beam.io.Write(beam.io.BigQuerySink(
                  output_table,
                  dataset='DS1', 
                  project=project, 
                  schema='dia:DATE, import:FLOAT',
                  create_disposition=CREATE_IF_NEEDED,
                      write_disposition=WRITE_TRUNCATE
                     )
                )

我想问题是这个管道似乎需要一个临时数据集才能完成工作。而且我无法强制此临时数据集的位置。因为我的DS1在欧盟(#EUROPE-WEST1),临时数据集在美国(我猜),任务失败:

WARNING:root:Dataset m-h-0000:temp_dataset_e433a0ef19e64100000000000001a does not exist so we will create it as temporary with location=None
WARNING:root:A task failed with exception.
 HttpError accessing <https://www.googleapis.com/bigquery/v2/projects/m-h-000000/queries/b8b2f00000000000000002bed336369d?alt=json&maxResults=10000>: response: <{'status': '400', 'content-length': '292', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'expires': 'Sat, 14 Oct 2017 20:29:15 GMT', 'vary': 'Origin, X-Origin', 'server': 'GSE', '-content-encoding': 'gzip', 'cache-control': 'private, max-age=0', 'date': 'Sat, 14 Oct 2017 20:29:15 GMT', 'x-frame-options': 'SAMEORIGIN', 'alt-svc': 'quic=":443"; ma=2592000; v="39,38,37,35"', 'content-type': 'application/json; charset=UTF-8'}>, content <{
 "error": {
  "errors": [
   {
    "domain": "global",
    "reason": "invalid",
    "message": "Cannot read and write in different locations: source: EU, destination: US"
   }
  ],
  "code": 400,
  "message": "Cannot read and write in different locations: source: EU, destination: US"
 }
}

管道选项:

options = PipelineOptions()

google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = 'm-h'
google_cloud_options.job_name = 'myjob3'
google_cloud_options.staging_location = r'gs://p_df/staging'  #EUROPE-WEST1
google_cloud_options.region=r'europe-west1'
google_cloud_options.temp_location = r'gs://p_df/temp' #EUROPE-WEST1
options.view_as(StandardOptions).runner =   'DirectRunner'  #'DataflowRunner'

p = beam.Pipeline(options=options)

如何避免此错误?

注意错误仅在我以DirectRunner运行时出现。

2 个答案:

答案 0 :(得分:3)

错误Cannot read and write in different locations非常自我解释,可能因为:

而发生
  • BigQuery数据集位于欧盟,您在美国运行DataFlow
  • 您的GCS存储桶位于欧盟,您在美国运行DataFlow

正如您在问题中指出的那样,您已在欧盟的GCS中创建了临时位置,而您的BigQuery数据集也位于欧盟中,因此您也必须在欧盟运行DataFlow作业。

为了实现这一目标,您需要在zone中指定PipelineOptions参数,如下所示:

options = PipelineOptions()

wo = options.view_as(WorkerOptions)  # type: WorkerOptions
wo.zone = "europe-west1-b"


# rest of your options:
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = 'm-h'
google_cloud_options.job_name = 'myjob3'
google_cloud_options.staging_location = r'gs://p_df/staging'  # EUROPE-WEST1
google_cloud_options.region = r'europe-west1'
google_cloud_options.temp_location = r'gs://p_df/temp'  # EUROPE-WEST1
options.view_as(StandardOptions).runner = 'DataFlowRunner'

p = beam.Pipeline(options=options)

答案 1 :(得分:2)

Python DirectRunner中使用的BigQuerySource转换不会自动确定临时表的位置。有关此问题,请参阅BEAM-1909

使用DataflowRunner时,这应该可以。