使用Apache Beam以CSV格式将BigQuery结果写入GCS

时间:2018-10-22 12:27:41

标签: python google-bigquery google-cloud-dataflow apache-beam

我是在Apache Beam上工作的新手,我试图编写一个管道来从Google BigQuery中提取数据,然后使用Python将数据以CSV格式写入GCS。

我使用beam.io.read(beam.io.BigQuerySource())可以从BigQuery读取数据,但不确定如何将其以CSV格式写入GCS。

是否有实现相同功能的自定义功能,请您帮我吗?

import logging

import apache_beam as beam


PROJECT='project_id'
BUCKET='project_bucket'


def run():
    argv = [
        '--project={0}'.format(PROJECT),
        '--job_name=readwritebq',
        '--save_main_session',
        '--staging_location=gs://{0}/staging/'.format(BUCKET),
        '--temp_location=gs://{0}/staging/'.format(BUCKET),
        '--runner=DataflowRunner'
]

with beam.Pipeline(argv=argv) as p:

    # Execute the SQL in big query and store the result data set into given Destination big query table.
    BQ_SQL_TO_TABLE = p | 'read_bq_view' >> beam.io.Read(
        beam.io.BigQuerySource(query =  'Select * from `dataset.table`', use_standard_sql=True))
    # Extract data from Bigquery to GCS in CSV format.
    # This is where I need your help

    BQ_SQL_TO_TABLE | 'Write_bq_table' >> beam.io.WriteToBigQuery(
            table='tablename',
            dataset='datasetname',
            project='project_id',
            schema='name:string,gender:string,count:integer',
            create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
            write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)

if __name__ == '__main__':
   logging.getLogger().setLevel(logging.INFO)
   run()

2 个答案:

答案 0 :(得分:4)

您可以使用WriteToText添加一个.csv后缀和headers。考虑到您需要将查询结果解析为CSV格式。例如,我使用莎士比亚公共dataset和以下查询:

  

从`bigquery-public-data.samples.shakespeare`中选择单词,单词数,语料库,在CHAR_LENGTH(word)> 3的情况下,按单词数排序限制10

我们现在使用以下内容读取查询结果:

BQ_DATA = p | 'read_bq_view' >> beam.io.Read(
    beam.io.BigQuerySource(query=query, use_standard_sql=True))

BQ_DATA现在包含键值对:

{u'corpus': u'hamlet', u'word': u'HAMLET', u'word_count': 407}
{u'corpus': u'kingrichardiii', u'word': u'that', u'word_count': 319}
{u'corpus': u'othello', u'word': u'OTHELLO', u'word_count': 313}

我们可以应用beam.Map函数仅产生值:

BQ_VALUES = BQ_DATA | 'read values' >> beam.Map(lambda x: x.values())

BQ_VALUES的摘录:

[u'hamlet', u'HAMLET', 407]
[u'kingrichardiii', u'that', 319]
[u'othello', u'OTHELLO', 313]

最后再次映射,使所有列值用逗号而不是列表分隔(考虑到如果双引号可以出现在字段中,则需要转义):

BQ_CSV = BQ_VALUES | 'CSV format' >> beam.Map(
    lambda row: ', '.join(['"'+ str(column) +'"' for column in row]))

现在,我们将带有后缀和标头的结果写入GCS:

BQ_CSV | 'Write_to_GCS' >> beam.io.WriteToText(
    'gs://{0}/results/output'.format(BUCKET), file_name_suffix='.csv', header='word, word count, corpus')

书面结果:

$ gsutil cat gs://$BUCKET/results/output-00000-of-00001.csv
word, word count, corpus
"hamlet", "HAMLET", "407"
"kingrichardiii", "that", "319"
"othello", "OTHELLO", "313"
"merrywivesofwindsor", "MISTRESS", "310"
"othello", "IAGO", "299"
"antonyandcleopatra", "ANTONY", "284"
"asyoulikeit", "that", "281"
"antonyandcleopatra", "CLEOPATRA", "274"
"measureforemeasure", "your", "274"
"romeoandjuliet", "that", "270"

答案 1 :(得分:1)

对于使用 Python 3 寻找更新的任何人,请替换

{0=object0, 1=object0, 3=objectZ, 4=objectZ}

BQ_VALUES = BQ_DATA | 'read values' >> beam.Map(lambda x: x.values())