来自pyarrow的内存泄漏?

时间:2018-10-26 22:01:52

标签: python pandas parquet pyarrow

要解析较大的文件,我需要依次循环写入大量镶木地板文件。但是,似乎此任务消耗的内存在每次迭代中都会增加,而我希望它保持不变(因为不应在内存中附加任何内容)。这使其很难进行缩放。

我添加了一个最小的可复制示例,该示例创建了10000个实木复合地板并对其进行循环附加。

import resource
import random
import string
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd


def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))

schema = pa.schema([
                        pa.field('test', pa.string()),
                    ])

resource.setrlimit(resource.RLIMIT_NOFILE, (1000000, 1000000))
number_files = 10000
number_rows_increment = 1000
number_iterations = 100

writers = [pq.ParquetWriter('test_'+id_generator()+'.parquet', schema) for i in range(number_files)]

for i in range(number_iterations):
    for writer in writers:
        table_to_write = pa.Table.from_pandas(
                            pd.DataFrame({'test': [id_generator() for i in range(number_rows_increment)]}),
                            preserve_index=False,
                            schema = schema,
                            nthreads = 1)
        table_to_write = table_to_write.replace_schema_metadata(None)
        writer.write_table(table_to_write)
    print(i)

for writer in writers:
    writer.close()

会有人知道导致这种泄漏的原因以及如何防止这种泄漏吗?

1 个答案:

答案 0 :(得分:1)

我们不确定出什么问题,但是其他一些用户报告了尚未诊断的内存泄漏。我已将您的示例添加到跟踪JIRA问题https://issues.apache.org/jira/browse/ARROW-3324