我的输入数据如下所示:
[someGarbagevalue]{"Id": 1, "Address": {"Street":"MG Road","City":"Pune"}}
[someGarbagevalue]{"Id": 2, "Address": {"City":"Mumbai"}}
[someGarbagevalue]{"Id": 3, "Address": {"Street":"XYZ Road"}}
[someGarbagevalue]{"Id": 4}
[someGarbagevalue]{"Id": 5, "PhoneNumber": 12345678, "Address": {"Street":"ABCD Road", "City":"Bangalore"}}
我在读取数据之后将[someGarbagevalue]
条带化,然后尝试写入BigQuery:
class processFunction(beam.DoFn):
def process(self, element):
global line
line = element[element.find(']') + 1:].strip()
return [line]
def run(argv=None):
pipeline_options = PipelineOptions()
p = beam.Pipeline(options=pipeline_options)
first = p | 'read' >> ReadFromText(wordcount_options.input)
second = (first
| 'process' >> (beam.ParDo(processFunction()))
| 'write' >> beam.io.WriteToBigQuery(
'myBucket:tableFolder.test_table')
问题:
line
类型的BigQuery
STRING
。 当前错误:
Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Error while reading data, error message: JSON parsing error in row starting at position 0: Value encountered without start of object.
答案 0 :(得分:0)
您的代码有一些缺失/错误:
global line
中使用processFunction
?那里不需要它。您应该在WriteToBigQuery
processFunction
应该返回带有schema的字段的字典。该字段的值应为您的字符串。
您的代码应该或多或少看起来像这样:
class processFunction(beam.DoFn):
def process(self, element):
line = element[element.find(']') + 1:].strip()
return {
"line": line
}
def run(argv=None):
pipeline_options = PipelineOptions()
p = beam.Pipeline(options=pipeline_options)
first = p | 'read' >> ReadFromText(wordcount_options.input)
second = (first
| 'process' >> (beam.ParDo(processFunction()))
| 'write' >> beam.io.WriteToBigQuery(
'myBucket:tableFolder.test_table',schema="line:STRING")