如何确保后续数据流作业将在同一台计算机上执行

时间:2019-04-25 06:23:52

标签: google-cloud-platform google-cloud-dataflow dataflow

我能够使用for循环通过数据流将多个CSV文件加载到bigquery。 但是在这种情况下,每次都会触发新的数据流,从而导致额外的开销。

我的代码的DataFlow部分:

def run(abs_csv_file_name="", table_name="", argv=None):
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_csv_file',
                        dest='input_csv_file',
               default='gs://{0}/{1}'.format(bucket_name,abs_csv_file_name),
                        help='Input file to process.')
    parser.add_argument('--output_stage_bq',
                        dest='output_stage_bq',
                        default='{0}:{1}.{2}'.format(project_id,stage_dataset_name,table_name),
                        help='Output file to write results to.')

    parser.add_argument('--output_target_bq',
                        dest='output_target_bq',
                        default='{0}:{1}.{2}'.format(project_id,dataset_name,table_name),
                        help='Output file to write results to.')

    known_args, pipeline_args = parser.parse_known_args(argv)

    pipeline_options = PipelineOptions(pipeline_args)

    # delete_a_bq_table(table_name)
    table_spec = "{0}:{1}.{2}".format(project_id, stage_dataset_name, table_name)

    with beam.Pipeline(options=pipeline_options) as p1:
        data_csv = p1 | 'Read CSV file' >> ReadFromText(known_args.input_csv_file)
        dict1 = (data_csv | 'Format to json' >> (beam.ParDo(Split())))
        (dict1 | 'Write to BigQuery' >> beam.io.WriteToBigQuery(
                                            known_args.output_stage_bq,
                                            schema=product_revenue_schema
                                            ))
        fullTable = (p1 | 'ReadFromBQ' >> beam.io.Read(beam.io.BigQuerySource(table_spec)))
        (fullTable | 'writeToBQ another dataset' >> beam.io.WriteToBigQuery(known_args.output_target_bq,
                                schema = product_revenue_schema))

我相信应该有比每次调用run函数更好的方法。

for i in range(len(table_names)):
    find_product_revenue_schema_and_column_name(table_name=table_names[i])

    run(abs_csv_file_name=abs_file_names[i], table_name=table_names[i])

我需要编写代码来确保后续的数据流作业将在同一台计算机上执行,以便节省计算机的设置时间。

0 个答案:

没有答案