Question

清理写入GCS的CSV文件，删除引号，然后LZIP该文件。 我们是否必须将文件复制到本地以执行清理并lzip文件，如何实现？

将已清理的LZIP文件移动到S3。 Datflow是否可以与S3通信并写入文件？我该如何做到这一点

下面的示例代码

import logging

import apache_beam as beam


PROJECT='project_id'
BUCKET='project_bucket'


def run():
  argv = [
    '--project={0}'.format(PROJECT),
    '--job_name=readwritebq',
    '--save_main_session',
    '--staging_location=gs://{0}/staging/'.format(BUCKET),
    '--temp_location=gs://{0}/staging/'.format(BUCKET),
    '--runner=DataflowRunner'
         ]

with beam.Pipeline(argv=argv) as p:

# Execute the SQL in big query and store the result data set into given 
  Destination big query table.
  BQ_DATA = p | 'read_bq_view' >> beam.io.Read(
  beam.io.BigQuerySource(query =  'Select * from `dataset.table`', 
  use_standard_sql=True))

# Destination BQtable
BQ_DATA | 'Write_bq_table' >> beam.io.WriteToBigQuery(
        table='tablename',
        dataset='datasetname',
        project='project_id',
        schema='name:string,gender:string,count:integer',
        create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
        write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)

# write the data from BQ_DATA to GCS in CSV format.
BQ_VALUES = BQ_DATA | 'read values' >> beam.Map(lambda x: x.values())
BQ_CSV = BQ_VALUES | 'CSV format' >> beam.Map(
lambda row: ', '.join(['"' + str(column) + '"' for column in row]))
BQ_CSV | 'Write_to_GCS' >> beam.io.WriteToText(
'gs://{0}/results/output'.format(BUCKET), file_name_suffix='.csv', 
 header='word, word count, corpus')

# Clean up the CSV file written to GCS removing the quotes and LZIP the file
  **Do we have to copy the file to local to perform the cleanup and lzip the file 
  , how this can be acheived ?**

# Move the cleaned LZIP file to S3
  **Can datflow communicate to S3 and write files ? how can i acheive this**

 if __name__ == '__main__':
   logging.getLogger().setLevel(logging.INFO)
   run()

使用Apache Beam的pardo和dofn删除引号并lzip csv文件，然后将lzip文件传输到Amazon S3

0 个答案: