Avro DataFileReader需要一个可搜索的文件

时间:2018-02-23 17:57:00

标签: python python-3.x avro

问题是stdin不支持avro所需的搜索,所以我们读取所有缓冲区,然后将其提供给avro_wrapper。它适用于Python 2,但在Python 3中不起作用。我尝试了一些解决方案,但它们都没有工作。

# stdin doesn't support seek which is needed by avro... so this hack worked in python 2. This does not work in Python 3. 
# Reading everything to buffer and then giving this to avro_wrapper. 
buf = StringIO()
buf.write(args.input_file.read())
r = DataFileReader(buf, DatumReader())
# Very first record the headers information. Which gives the header names in order along with munge header names for all the record types
# For e.g if we have 2 ports then it will hold the header information of
#   1. port1 on name1 key
#   2. port2 on name2 key and so on 
headers_record = next(r)['headers']

以上产生UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 17: invalid continuation byte错误。

然后我们尝试这样做:

input_stream = io.TextIOWrapper(args.input_file.buffer, encoding='latin-1')
sio = io.StringIO(input_stream.read())
r = DataFileReader(sio, DatumReader())
headers_record = next(r)['headers']

这会产生avro.schema.AvroException: Not an Avro data file: Obj doesn't match b'Obj\x01'.错误。

另一种方式:

input_stream = io.TextIOWrapper(args.input_file.buffer, encoding='latin-1')
buf = io.BytesIO(input_stream.read().encode('latin-1'))
r = DataFileReader(buf.read(), DatumReader())
headers_record = next(r)['headers']

这会产生AttributeError: 'bytes' object has no attribute 'seek'" error.

1 个答案:

答案 0 :(得分:0)

io.BytesIO()是用于创建包含二进制数据的可搜索内存文件对象的正确类型。

但是,您错误地从bytes文件对象中读取io.BytesIO()数据,并将其传递给而不是实际的文件对象。

不要阅读,使用从io.BytesIO读取的二进制数据传入实际的stdin文件对象:

buf = io.BytesIO(args.input_file.buffer.read())
r = DataFileReader(buf, DatumReader())

我直接传递了args.input_file.buffer数据,假设args.input是解析stdin字节的TextIOWrapper实例,而.buffer是基础BufferedReader提供原始二进制数据的实例。将此数据解码为Latin-1,然后再次编码为Latin-1没有意义。只需传递字节。