将数据从csv写入BigQuery

时间:2017-08-30 05:41:32

标签: python google-cloud-dataflow

我编写了一个Python数据流作业来从csv文件中读取数据,并使用该数据填充BigQuery表。但是,每当我运行此作业时,错误都会一直弹出。如果我删除写入Big Query部分并写入文件,则代码执行正常,表格将以dict格式写入输出文件。代码如下:

import argparse
import logging
import re
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.metrics import Metrics
from apache_beam.metrics.metric import MetricsFilter
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
import json
from apache_beam.io.gcp.bigquery import TableRowJsonCoder
class ToTableRowDoFn(beam.DoFn):
  def process(self,x):
    values = x.split(',')
    rows={}
    rows["Name"]=values[0]
    rows["Place_of_Birth"]=values[1]
    rows["Age"]=values[2]
    return [rows]
parser = argparse.ArgumentParser()
parser.add_argument('--input',
                  dest='input',
                  default='gs://dataflow-samples/shakespeare/kinglear.txt',
                  help='Input file to process.')
parser.add_argument('--output',
                  dest='output',   
                  help='Output file to write results to.')
known_args, pipeline_args = parser.parse_known_args(None)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
p = beam.Pipeline(options=pipeline_options)
# Read the text file[pattern] into a PCollection.

lines = p | 'read' >> ReadFromText(known_args.input)
lines | 'ToTableRows' >> beam.ParDo(ToTableRowDoFn()) | 'write' >> 
beam.io.Write(beam.io.BigQuerySink(
  'xxxx:ZZZZZZ.YYYYY',
  schema='Name:STRING, Place_of_Birth:STRING, Age:STRING'))

# Actually run the pipeline (all operations above are deferred).
result = p.run()

我正在加载以下csv文件:

Name1,Place1,40
Name2,Place2,20

我在csv文件上运行此代码时遇到的错误如下:

AttributeError: 'FieldList' object has no attribute '_FieldList__field'

如果我删除了WritetoBigQuery部分并写入文件,则代码工作正常。请帮我解决这个问题。

1 个答案:

答案 0 :(得分:0)

我有同样的问题。发布此线程上发生的其他事件。它与酸洗有关。 您必须禁用save_main_session选项。我只是在管道选项中对此进行了评论以进行测试。 https://issues.apache.org/jira/browse/BEAM-3134

https://cloud.google.com/dataflow/faq#how-do-i-handle-nameerrors