Question

我有一个数据流管道，将数据从pub子解析为大查询。数据采用proto3格式。

我从pubsub接收的数据使用protobuf的'SerializeToString（）'方法进行编码。
然后，我将其反序列化，然后将已解析的数据插入bigquery中，它可以正常工作。但是，在插入时出现错误时，我一直被要求存储来自probotobuf的二进制数据。
为此，我创建了一个仅包含一个字段“数据”的简单bigquery表，该表接受BYTES。

所以我在管道中添加了一个步骤，它只是从PubSub消息中获取数据并返回：

class GetBytes(beam.DoFn):
    def process(self, element):

        obj: Dict = {
            'data': element.data
        }
        logging.info(f'data bytes: {obj}')
        logging.info(f'data type: {type(obj["data"])}')
        return [obj]

这是我用来插入到BQ的管道中的行：

    bytes_status = (status | 'Get Bytes Result' >> beam.ParDo(GetBytes()))
    bytes_status | 'Write to BQ BackUp' >> beam.io.WriteToBigQuery('my_project:my_dataset.my_table')

日志似乎获得了正确的数据：

2020-09-29 11：16：40.094 CESTdata字节：{'data'：b'\ n \ x04 \ x08 \ x01 \ x10 \ x02 \ n \ x04 \ x08 \ x02 \ x10 \ x02 \ n \ x02 \ x08 \ x03 \ n \ x04 \ x08 \ x04 \ x10 \ x02 \ n \ x04 \ x08 \ x05 \ x10 \ x02 \ n \ x04 \ x08 \ x06 \ x10 \ x02 \ n \ x02 \ x08 \ x07 \ n \ x04 \ x08 \ x08 \ x10 \ x01 \ n \ x02 \ x08 \ t \ n \ x04 \ x08 \ n \ x10 \ x01 \ n \ x04 \ x08 \ x0b \ x10 \ x02 \ n \ x02 \ x08 \ x0c \ n \ x04 \ x08 \ r \ x10 \ x02 \ n \ x04 \ x08 \ x0e \ x10 \ x02 \ n \ x04 \ x08 \ x0f \ x10 \ x02 \ n \ x04 \ x08 \ x10 \ x10 \ x02 \ n \ x04 \ x08 \ x11 \ x10 \ x01 \ n \ x04 \ x08 \ x12 \ x10 \ x01 \ n \ x04 \ x08 \ x01 \ x10 \ x02 \ n \ x02 \ x08 \ x02 \ n \ x04 \ x08 \ x03 \ x10 \ x01 \ n \ x02 \ x08 \ x04 \ n \ x04 \ x08 \ x05 \ x10 \ x02 \ n \ x04 \ x08 \ x06 \ x10 \ x01 \ n \ x04 \ x08 \ x07 \ x10 \ x02 \ n \ x02 \ x08 \ x08 \ n \ x04 \ x08 \ t \ x10 \ x01 \ n \ x04 \ x08 \ n \ x10 \ x02 \ n \ x04 \ x08 \ x0b \ x10 \ x01 \ n \ x02 \ x08 \ x0c \ n \ x04 \ x08 \ r \ x10 \ x02 \ n \ x04 \ x08 \ x0e \ x10 \ x02 \ n \ x04 \ x08 \ x0f \ x10 \ x02 \ n \ x04 \ x08 \ x10 \ x10 \ x02 \ n \ x04 \ x08 \ x11 \ x10 \ x02 \ n \ x04 \ x08 \ x12 \ x10 \ x02 \ x10 \ xb4 \ x95 \ x99 \ xc9 \ xcd。'}

但是我一直收到以下错误消息：

UnicodeDecodeError：'utf-8 [在运行'generatedPtransform-297'时]'编解码器无法解码位置101的字节0x89：无效的起始字节

（也许错误与以前的日志不符，但这总是这种消息）

我试图从BigQuery UI中插入字节数据，一切正常...

关于出什么问题了吗？

谢谢：）

Answer 1

以这种方式编写时，BigQuery要求bytes值必须经过base64编码。您可以在https://beam.apache.org/releases/pydoc/2.24.0/apache_beam.io.gcp.bigquery.html#additional-parameters-for-bigquery-tables

中找到一些文档和链接以获取更多详细信息。

BigQuery不接受来自protobuf的二进制数据

1 个答案: