我想用Python将大数据流写入实木复合地板文件中。我的数据非常庞大,我无法将其保存在内存中并一次性写入。
我找到了两个可以在Parquet文件上读写的Python库(Pyarrow,Fastparquet)。这是我正在使用Pyarrow的解决方案,但是如果您知道有效的解决方案,我很乐意尝试另一个库:
import pandas as pd
import random
import pyarrow as pa
import pyarrow.parquet as pq
def data_generator():
# This is a simulation for my generator function
# It is not allowed to change the nature of this function
options = ['op1', 'op2', 'op3', 'op4']
while True:
dd = {'c1': random.randint(1, 10), 'c2': random.choice(options)}
yield dd
result_file_address = 'example.parquet'
index = 0
try:
dic_data = next(data_generator())
df = pd.DataFrame(dic_data, [index])
table = pa.Table.from_pandas(df)
with pq.ParquetWriter(result_file_address, table.schema,
compression='gzip', use_dictionary=['c1', 'c2']
) as writer:
writer.write_table(table)
for dic_data in data_generator():
index += 1
df = pd.DataFrame(dic_data, [index])
table = pa.Table.from_pandas(df)
writer.write_table(table=table)
except StopIteration:
pass
finally:
del data_generator
上述代码存在以下问题:
Traceback (most recent call last):
File "stringsource", line 15, in string.from_py.__pyx_convert_string_from_py_std__in_string
TypeError: expected bytes, str found
Exception ignored in: 'pyarrow._parquet.ParquetWriter._set_dictionary_props'
Traceback (most recent call last):
File "stringsource", line 15, in string.from_py.__pyx_convert_string_from_py_std__in_string
TypeError: expected bytes, str found
非常感谢!