以下是应该从csv文件读取并写入另一个csv文件和BigQuery的代码:
import argparse
import logging
import re
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.metrics import Metrics
from apache_beam.metrics.metric import MetricsFilter
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
parser = argparse.ArgumentParser()
parser.add_argument('--input',
dest='input',
default='gs://dataflow-samples/shakespeare/kinglear.txt',
help='Input file to process.')
parser.add_argument('--output',
dest='output',
required=True,
help='Output file to write results to.')
known_args, pipeline_args = parser.parse_known_args(None)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
p = beam.Pipeline(options=pipeline_options)
# Read the text file[pattern] into a PCollection.
lines = p | 'read' >> ReadFromText(known_args.input)
lines | beam.Map(lambda x: x.split(','))
lines | 'write' >> WriteToText(known_args.output)
lines | 'write2' >> beam.io.Write(beam.io.BigQuerySink('xxxx:yyyy.aaaa'))
# Actually run the pipeline (all operations above are deferred).
result = p.run()
它能够写入输出文件但是无法对BigQuery表执行此操作(xxxx:yyyy.aaaa)
以下是显示的消息:
WARNING:root:A task failed with exception.
'unicode' object has no attribute 'iteritems'
即使模式相同且BigQuery表为空,csv文件中包含的表也不会写入BigQuery。我怀疑这是因为数据必须转换为JSON格式。 为了使代码正常工作,必须对此代码进行哪些更正?您能否提供我必须添加的代码行以使其正常工作?
答案 0 :(得分:0)
看以下几行:
1: lines = p | 'read' >> ReadFromText(known_args.input)
2: lines | beam.Map(lambda x: x.split(','))
3: lines | 'write' >> WriteToText(known_args.output)
4: lines | 'write2' >> beam.io.Write(beam.io.BigQuerySink('xxxx:yyyy.aaaa'))
lines
定义为从文本文件中读取的行的PCollection。如果查看BigQuery tornadoes example,您可以看到(1)您需要将每行转换为字典,每个列ad(2)都有字段,您需要提供与该字典匹配的模式到BigQuerySink 。例如:
def to_table_row(x):
values = x.split(',')
return { 'field1': values[0], 'field2': values[1] }
lines = p | 'read' >> ReadFromText(known_args.input)
lines
| 'write' >> WriteToText(known_args.output)
lines
| 'ToTableRows' >> beam.Map(to_table_row)
| 'write2' >> beam.io.Write(beam.io.BigQuerySink(
'xxxx:yyyy.aaaa',
schema='field1:INTEGER, field2:INTEGER'))