如何从pyarrow缓冲区反序列化RecordBatch

时间:2019-10-17 09:18:38

标签: python pyarrow

我的目标是序列化RecordBatch,并通过网络套接字通道发送 并在接收方反序列化。

在接收方,在接收到数据包并重建之后 一个pyarrow.lib.Bufferpa.py_buffer的对象,我是 无法将其反序列化回RecordBatch

从websocket的样板开始,这是一个摘要,总结了我正在尝试做的事情:

import pyarrow as pa

indicators = [(1, 'A'), (2, 'B')]

id = pa.int16()
name = pa.string()

data = pa.array(indicators, type=pa.struct([('id', id), ('name', name)]))

batch = pa.RecordBatch.from_arrays([data], ['indicators'])

buffer = batch.serialize()

# How to get back a RecordBatch from buffer?
#
# ???

1 个答案:

答案 0 :(得分:1)

使用这种Build options: <truncated> Executing action 'BazelWorkspaceStatusAction stable-status.txt': unconditional execution is requested. Executing action 'FileWrite build-info-volatile.h': unconditional execution is requested. Executing action 'Testing <test-name>': One of the files has changed. 方法时,可以使用serialize函数给出一个已知的模式:

read_record_batch

但这意味着您需要了解接收方的模式。要将其封装在序列化数据中,请改用>>> pa.ipc.read_record_batch(buffer, batch.schema) <pyarrow.lib.RecordBatch at 0x7ff412257278>

RecordBatchStreamWriter
>>> sink = pa.BufferOutputStream()
>>> writer = pa.RecordBatchStreamWriter(sink, batch.schema)
>>> writer.write_batch(batch)
>>> writer.close()
>>> buf = sink.getvalue()

请参阅https://arrow.apache.org/docs/python/ipc.html

中的文档