如何在BigQuery Stream Pipeline中处理无ascii字符?

时间:2019-04-27 03:48:07

标签: python google-bigquery google-cloud-dataflow apache-beam

如何在使用Python流入Google Cloud BigQuery的DataFlow管道中支持非ASCII字符?

下面的这段代码尝试在Big Query中插入一行,但由于此错误而失败:

  

ValidationError:字段string_value遇到非ASCII字符串'此代码因\ xc3 \ xa3(非ascii字符而失败!):'ascii'编解码器无法解码位置23的字节0xc3:序数不在范围内( 128)

如果我们从文本中删除'ã'字符,则代码会正常工作。

# -*- coding: utf-8 -*-

from __future__ import absolute_import
import argparse
import datetime

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import GoogleCloudOptions

options = PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)


def run(argv=None):
    parser = argparse.ArgumentParser()
    known_args, pipeline_args = parser.parse_known_args(argv)
    p = beam.Pipeline(options=PipelineOptions(pipeline_args))

    (p
     | 'Create the PCollection' >> beam.Create([{
                'timestamp': datetime.datetime.now().isoformat(),
                'text': 'This code fail because of ã, a non-ascii character!',
            }])
     | 'Write to BigQuery' >> beam.io.Write(
                beam.io.BigQuerySink(
                    'lake.test',
                    schema='timestamp:TIMESTAMP,text:STRING',
                    create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                    write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
                )
            )
     )

    p.run().wait_until_finish()


if __name__ == '__main__':
    run()

0 个答案:

没有答案