Question

我正在构建一个从Kafka接收数据的应用程序。使用Apache（https://pypi.org/project/avro-python3/）提供的标准avro库时，结果是正确的，但是反序列化过程非常慢。

class KafkaReceiver:
    data = {}

    def __init__(self, bootstrap='192.168.1.111:9092'):
        self.client = KafkaConsumer(
            'topic',
            bootstrap_servers=bootstrap,
            client_id='app',
            api_version=(0, 10, 1)
        )
        self.schema = avro.schema.parse(open("Schema.avsc", "rb").read())
        self.reader = avro.io.DatumReader(self.schema)

    def do(self):
        for msg in self.client:
            bytes_reader = io.BytesIO(msg.value)
            decoder = BinaryDecoder(bytes_reader)

            self.data = self.reader.read(decoder)

在阅读为什么这么慢的时候，我发现fastavro应该会快得多。我使用这种方式：

    def do(self):

        schema = fastavro.schema.load_schema('Schema.avsc')
        for msg in self.client:
            bytes_reader = io.BytesIO(msg.value)
            bytes_reader.seek(0)
            for record in reader(bytes_reader, schema):
                self.data = record

并且，由于使用Apache的librabry时一切正常，因此我希望一切与fastavro相同。但是，运行此程序时，我得到

  File "fastavro/_read.pyx", line 389, in fastavro._read.read_map
  File "fastavro/_read.pyx", line 290, in fastavro._read.read_utf8
  File "fastavro/_six.pyx", line 22, in fastavro._six.py3_btou
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 3: invalid start byte

我通常不使用Python进行编程，因此我并不完全知道该如何处理。有什么想法吗？

Answer 1

fastavro.reader需要包含标头的avro文件格式。看起来您所拥有的是没有标题的序列化记录。我认为您也许可以使用fastavro.schemaless_reader阅读此内容。

所以代替：

for record in reader(bytes_reader, schema):
    self.data = record

您会这样做：

self.data = schemaless_reader(bytes_reader, schema)

使用fastavro从卡夫卡进行Avro反序列化

1 个答案: