Question

我有很大的CSV文件，最终我希望将其转换为镶木地板。由于内存限制及其难以处理NULL值（这在我的数据中很常见），Pandas无法提供帮助。我检查了PyArrow文档，并且有一些用于读取镶木地板文件的工具，但是我看不到任何有关读取CSV的信息。我错过了什么吗，或者此功能与PyArrow不兼容？

Answer 1

我们正在使用此功能，现在有一个拉取请求：https://github.com/apache/arrow/pull/2576。您可以通过测试来提供帮助！

Answer 2

您可以使用pd.read_csv(chunksize=...)读取CSV块，然后使用Pyarrow一次写入一个块。

一个警告是，正如您提到的，如果您的一列中全部为空，则Pandas将给出不一致的dtype，因此您必须确保该块的大小大于数据中最长的空值运行时间

这将从stdin读取CSV并将Parquet写入stdout（Python 3）。

#!/usr/bin/env python
import sys

import pandas as pd
import pyarrow.parquet

# This has to be big enough you don't get a chunk of all nulls: https://issues.apache.org/jira/browse/ARROW-2659
SPLIT_ROWS = 2 ** 16

def main():
    writer = None
    for split in pd.read_csv(sys.stdin.buffer, chunksize=SPLIT_ROWS):
        table = pyarrow.Table.from_pandas(split, preserve_index=False)
        # Timestamps have issues if you don't convert to ms. https://github.com/dask/fastparquet/issues/82
        writer = writer or pyarrow.parquet.ParquetWriter(sys.stdout.buffer, table.schema, coerce_timestamps='ms', compression='gzip')
        writer.write_table(table)
    writer.close()

if __name__ == "__main__":
    main()

使用PyArrow读取CSV

2 个答案: