Question

我有一些pyarrow Parquet数据集编写代码。我想要进行集成测试，以确保正确写入文件。我想通过向内存文件系统中写入一个小的示例数据块来做到这一点。但是，我正在努力寻找适用于Python的pyarrow兼容的内存中文件系统接口。

您将在下面找到一个包含filesystem变量的代码段。我想用内存文件系统替换filesystem变量，以后可以通过编程在集成测试中检查该文件系统。

import pyarrow.parquet as pq
pq.write_to_dataset(
        score_table,
        root_path=AWS_ZEBRA_OUTPUT_S3_PREFIX,
        filesystem=filesystem,
        partition_cols=[
            EQF_SNAPSHOT_YEAR_PARTITION,
            EQF_SNAPSHOT_MONTH_PARTITION,
            EQF_SNAPSHOT_DAY_PARTITION,
            ZEBRA_COMPUTATION_TIMESTAMP
        ]
    )

Answer 1

如果write_to_dataset为filesystem，则可以将内存文件对象传递给None。

所以您的通话可能会变成：

from io import BytesIO
import pyarrow.parquet as pq

with BytesIO() as f:
    pq.write_to_dataset(
        score_table,
        root_path=f,
        filesystem=None,
        partition_cols=[
            EQF_SNAPSHOT_YEAR_PARTITION,
            EQF_SNAPSHOT_MONTH_PARTITION,
            EQF_SNAPSHOT_DAY_PARTITION,
            ZEBRA_COMPUTATION_TIMESTAMP
        ]
    )

来自pyarrow来源的相关行：

def resolve_filesystem_and_path(where, filesystem=None):
    """
    Return filesystem from path which could be an HDFS URI, a local URI,
    or a plain filesystem path.
    """
    if not _is_path_like(where):
        if filesystem is not None:
            raise ValueError("filesystem passed but where is file-like, so"
                             " there is nothing to open with filesystem.")
        return filesystem, where

https://github.com/apache/arrow/blob/207b3507be82e92ebf29ec7d6d3b0bb86091c09a/python/pyarrow/filesystem.py#L402-L411

Answer 2

最后，我手动实现了pyarrow.FileSystem ABC的实例。似乎无法使用mock进行测试，因为pyarrow（不是用Python的方式）会检查传递给filesystem的{{1}}参数的类型：{{ 3}}。我建议更改此方法中的逻辑，以不显式检查类型（甚至最好使用write_to_dataset！），以简化测试。

在pyarrow测试中使用内存文件系统

2 个答案: