使用Python将大数据流写入Parquet

时间:2019-05-30 11:54:56

标签: python bigdata streaming parquet pyarrow

我想用Python将大数据流写入实木复合地板文件中。我的数据非常庞大,我无法将其保存在内存中并一次性写入。

我找到了两个可以在Parquet文件上读写的Python库(Pyarrow,Fastparquet)。这是我正在使用Pyarrow的解决方案,但是如果您知道有效的解决方案,我很乐意尝试另一个库:

import pandas as pd
import random
import pyarrow as pa
import pyarrow.parquet as pq


def data_generator():
    # This is a simulation for my generator function
    # It is not allowed to change the nature of this function
    options = ['op1', 'op2', 'op3', 'op4']
    while True:
        dd = {'c1': random.randint(1, 10), 'c2': random.choice(options)}
        yield dd


result_file_address = 'example.parquet'
index = 0

try:
    dic_data = next(data_generator())
    df = pd.DataFrame(dic_data, [index])
    table = pa.Table.from_pandas(df)
    with pq.ParquetWriter(result_file_address, table.schema,
                          compression='gzip', use_dictionary=['c1', 'c2']
                          ) as writer:
        writer.write_table(table)
        for dic_data in data_generator():
            index += 1
            df = pd.DataFrame(dic_data, [index])
            table = pa.Table.from_pandas(df)
            writer.write_table(table=table)
except StopIteration:
    pass
finally:
    del data_generator

上述代码存在以下问题:

  • 所有数据都存储在RAM中,并且在过程结束时会写到磁盘上,由于RAM的大小限制,这对我来说不切实际。
  • 我可以使用7zip显着减小最终结果的大小。似乎压缩无法正常工作。
  • 我从使用use_dictinary时收到以下警告:
Traceback (most recent call last):
  File "stringsource", line 15, in string.from_py.__pyx_convert_string_from_py_std__in_string
TypeError: expected bytes, str found
Exception ignored in: 'pyarrow._parquet.ParquetWriter._set_dictionary_props'
Traceback (most recent call last):
  File "stringsource", line 15, in string.from_py.__pyx_convert_string_from_py_std__in_string
TypeError: expected bytes, str found

非常感谢!

0 个答案:

没有答案