Question

我的云功能是根据某些规则形成一个动态查询，并将其存储在文件中到云存储中，然后它将调用我的数据流模板。我将输入文件作为ValueProvider传递给我的数据流模板，该模板正在保存查询，我试图在我的管道中使用beam.io.BigQuerySource使用该查询。但这给我一个错误：ValueError：必须指定BigQuery表或查询

一些云功能代码：

query_job = client.query(
        query,
        job_config=job_config)
    query_job.result()
    print('Query results loaded to table {}'.format(table_ref.path))
    file_name = '{}_RM_{}.csv'.format(unit, datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S:%f')[:-3])
    destination_uri = "gs://test-bucket/{}".format(file_name)
    dataset_ref = client.dataset(dataset_id, project=PROJECT)
    table_ref = dataset_ref.table(table_name)

    extract_job = client.extract_table(
        table_ref,
        destination_uri)
    extract_job.result() #Extracts results to the GCS
    client.delete_table(table_ref) #Deletes table in BQ


    BODY = {
        "jobName": "{jobname}".format(jobname=JOBNAME),
        "parameters": {
            "inputFile": destination_uri
        },
        "environment": {
            "tempLocation": "gs://{bucket}/temp".format(bucket=BUCKET),
            "zone": "europe-west1-b"
        }
    }

    request = service.projects().templates().launch(projectId=PROJECT, gcsPath=GCSPATH, body=BODY)
    response = request.execute()

数据流代码：

class UserOptions(PipelineOptions):
    @classmethod
    def _add_argparse_args(cls, parser):
        parser.add_value_provider_argument('--inputFile', default='query.txt')


class Query:
    def query_final(self, inputFile):
        from google.cloud import storage
        client = storage.Client()
        bucket = client.get_bucket('ingka-retention-test-bucket')
        blob = bucket.get_blob(str(inputFile))
        return blob
def dataflow():
    options = PipelineOptions.from_dictionary(pipeline_options)
    user_options = options.view_as(UserOptions)

    inputFile = user_options.inputFile
    new_query = Query()
    final_query = new_query.query_final(inputFile)

    with beam.Pipeline(options=options) as p:
        rows = p | 'Read Orders from BigQuery ' >> beam.io.Read(beam.io.BigQuerySource(query=final_query, use_standard_sql=True))

实现此任务的原因或其他最佳方法是什么？预先谢谢您！

Answer 1

这对于BigQuery源是不可能的，因为在图形编译时需要该语句。源和设置在编译时被锁定。

解决方案：您可以在ParDo中使用内联的BQ API，并像上面所做的一样对代码进行参数化。那是在运行时解释的。要启动ParDo，您将构造一个PCollection，其中包含一些与您要进行呼叫的N次相对应的项。请记住，如果您的整个ParDo范围发生故障，您将必须处理所有幂等问题。

[How to run dynamic second query in google cloud dataflow?

ValueError：必须指定BigQuery表或查询

1 个答案: