读取csv文件并在BigQuery表中填充数据

时间:2017-07-31 09:03:57

标签: google-cloud-dataflow

以下是应该从csv文件读取并写入另一个csv文件和BigQuery的代码:

import argparse
import logging
import re
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.metrics import Metrics
from apache_beam.metrics.metric import MetricsFilter
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
parser = argparse.ArgumentParser()
parser.add_argument('--input',
                  dest='input',
                  default='gs://dataflow-samples/shakespeare/kinglear.txt',
                  help='Input file to process.')
parser.add_argument('--output',
                  dest='output',
                  required=True,
                  help='Output file to write results to.')
known_args, pipeline_args = parser.parse_known_args(None)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
p = beam.Pipeline(options=pipeline_options)
# Read the text file[pattern] into a PCollection.
lines = p | 'read' >> ReadFromText(known_args.input)
lines | beam.Map(lambda x: x.split(','))
lines | 'write' >> WriteToText(known_args.output)
lines | 'write2' >> beam.io.Write(beam.io.BigQuerySink('xxxx:yyyy.aaaa'))
# Actually run the pipeline (all operations above are deferred).
result = p.run()

它能够写入输出文件但是无法对BigQuery表执行此操作(xxxx:yyyy.aaaa)

以下是显示的消息:

WARNING:root:A task failed with exception.
'unicode' object has no attribute 'iteritems'

即使模式相同且BigQuery表为空,csv文件中包含的表也不会写入BigQuery。我怀疑这是因为数据必须转换为JSON格式。 为了使代码正常工作,必须对此代码进行哪些更正?您能否提供我必须添加的代码行以使其正常工作?

1 个答案:

答案 0 :(得分:0)

看以下几行:

1: lines = p | 'read' >> ReadFromText(known_args.input)
2: lines | beam.Map(lambda x: x.split(','))
3: lines | 'write' >> WriteToText(known_args.output)
4: lines | 'write2' >> beam.io.Write(beam.io.BigQuerySink('xxxx:yyyy.aaaa'))
  1. lines定义为从文本文件中读取的行的PCollection。
  2. 通过拆分每一行来创建新的PCollection。但它实际上并没有保留那个PCollection,所以它实际上什么也没做。
  3. 将原始行写入文本文件(因此每行看不到一个单词,每个输出上都会看到一个原始行。)
  4. 将从输入读取的行写入BigQuery文件。
  5. 如果查看BigQuery tornadoes example,您可以看到(1)您需要将每行转换为字典,每个列ad(2)都有字段,您需要提供与该字典匹配的模式到BigQuerySink 。例如:

    def to_table_row(x):
      values = x.split(',')
      return { 'field1': values[0], 'field2': values[1] } 
    
    lines = p | 'read' >> ReadFromText(known_args.input)
    lines
      | 'write' >> WriteToText(known_args.output)
    lines
      | 'ToTableRows' >> beam.Map(to_table_row)
      | 'write2' >> beam.io.Write(beam.io.BigQuerySink(
          'xxxx:yyyy.aaaa',
          schema='field1:INTEGER, field2:INTEGER'))