将JSON数据写入BigQuery表只是一个字符串类型的字段

时间:2018-02-20 06:49:00

标签: python google-bigquery google-cloud-dataflow apache-beam

我的输入数据如下所示:

[someGarbagevalue]{"Id": 1, "Address": {"Street":"MG Road","City":"Pune"}}
[someGarbagevalue]{"Id": 2, "Address": {"City":"Mumbai"}}
[someGarbagevalue]{"Id": 3, "Address": {"Street":"XYZ Road"}}
[someGarbagevalue]{"Id": 4}
[someGarbagevalue]{"Id": 5, "PhoneNumber": 12345678, "Address": {"Street":"ABCD Road", "City":"Bangalore"}}

我在读取数据之后将[someGarbagevalue]条带化,然后尝试写入BigQuery:

class processFunction(beam.DoFn):
  def process(self, element):
    global line
    line = element[element.find(']') + 1:].strip()
    return [line]

def run(argv=None):
    pipeline_options = PipelineOptions()
    p = beam.Pipeline(options=pipeline_options)
      first = p | 'read' >> ReadFromText(wordcount_options.input)
      second = (first
                | 'process' >> (beam.ParDo(processFunction()))
                | 'write' >> beam.io.WriteToBigQuery(
                  'myBucket:tableFolder.test_table')

问题

  1. 如何将数据写入每个line类型的BigQuery STRING
  2. 如果我将数据作为每一行写入BigQuery,我将如何查询BigQuery表?
  3. 当前错误:

    Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Error while reading data, error message: JSON parsing error in row starting at position 0: Value encountered without start of object.
    

1 个答案:

答案 0 :(得分:0)

您的代码有一些缺失/错误:

  1. 为什么在global line中使用processFunction?那里不需要它。
  2. 您应该在WriteToBigQuery

  3. 中指定BigQuery表架构
  4. processFunction应该返回带有schema的字段的字典。该字段的值应为您的字符串。

  5. 您的代码应该或多或少看起来像这样:

    class processFunction(beam.DoFn):
      def process(self, element):
        line = element[element.find(']') + 1:].strip()
        return {
            "line": line
        }
    
    def run(argv=None):
        pipeline_options = PipelineOptions()
        p = beam.Pipeline(options=pipeline_options)
          first = p | 'read' >> ReadFromText(wordcount_options.input)
          second = (first
                    | 'process' >> (beam.ParDo(processFunction()))
                    | 'write' >> beam.io.WriteToBigQuery(
                      'myBucket:tableFolder.test_table',schema="line:STRING")