如何在使用Python流入Google Cloud BigQuery的DataFlow管道中支持非ASCII字符?
下面的这段代码尝试在Big Query中插入一行,但由于此错误而失败:
ValidationError:字段string_value遇到非ASCII字符串'此代码因\ xc3 \ xa3(非ascii字符而失败!):'ascii'编解码器无法解码位置23的字节0xc3:序数不在范围内( 128)
如果我们从文本中删除'ã'字符,则代码会正常工作。
# -*- coding: utf-8 -*-
from __future__ import absolute_import
import argparse
import datetime
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import GoogleCloudOptions
options = PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)
def run(argv=None):
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args(argv)
p = beam.Pipeline(options=PipelineOptions(pipeline_args))
(p
| 'Create the PCollection' >> beam.Create([{
'timestamp': datetime.datetime.now().isoformat(),
'text': 'This code fail because of ã, a non-ascii character!',
}])
| 'Write to BigQuery' >> beam.io.Write(
beam.io.BigQuerySink(
'lake.test',
schema='timestamp:TIMESTAMP,text:STRING',
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
)
)
)
p.run().wait_until_finish()
if __name__ == '__main__':
run()