Question

无法加载相同列名但顺序不同的镶木地板文件。

场景：

ABD-MacBook-Pro:ttt abd$ tree
.
├── testing1.paquet
└── testing2.paquet

我有两个如上所述的实木复合地板文件。这两个文件中的列名相同，但是顺序不同，因此我可以使用Spark加载这些文件。你能不能让我知道我是否想念这里？还是pyarrow不支持？

我正在尝试使用以下命令加载那些实木复合地板文件。

pandas_df = pq.ParquetDataset('ttt', filesystem=file_system).read_pandas().to_pandas()

在运行上述命令时出现以下错误。

ValueError: Schema in ttt//testing2.paquet was different.

C1: string
C2: string
C3: string
C4: string
Unnamed: 4: double
Unnamed: 5: double
Unnamed: 6: double
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "C1", "field_name": "C1", "pandas_type": "unicode", "'
            b'numpy_type": "object", "metadata": null}, {"name": "C2", "field_'
            b'name": "C2", "pandas_type": "unicode", "numpy_type": "object", "'
            b'metadata": null}, {"name": "C3", "field_name": "C3", "pandas_typ'
            b'e": "unicode", "numpy_type": "object", "metadata": null}, {"name'
            b'": "C4", "field_name": "C4", "pandas_type": "unicode", "numpy_ty'
            b'pe": "object", "metadata": null}, {"name": "Unnamed: 4", "field_'
            b'name": "Unnamed: 4", "pandas_type": "float64", "numpy_type": "fl'
            b'oat64", "metadata": null}, {"name": "Unnamed: 5", "field_name": '
            b'"Unnamed: 5", "pandas_type": "float64", "numpy_type": "float64",'
            b' "metadata": null}, {"name": "Unnamed: 6", "field_name": "Unname'
            b'd: 6", "pandas_type": "float64", "numpy_type": "float64", "metad'
            b'ata": null}, {"name": null, "field_name": "__index_level_0__", "'
            b'pandas_type": "int64", "numpy_type": "int64", "metadata": null}]'
            b', "pandas_version": "0.23.0"}'}

vs

C1: string
C2: string
C4: string
C3: string
Unnamed: 4: double
Unnamed: 5: double
Unnamed: 6: double
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "C1", "field_name": "C1", "pandas_type": "unicode", "'
            b'numpy_type": "object", "metadata": null}, {"name": "C2", "field_'
            b'name": "C2", "pandas_type": "unicode", "numpy_type": "object", "'
            b'metadata": null}, {"name": "C4", "field_name": "C4", "pandas_typ'
            b'e": "unicode", "numpy_type": "object", "metadata": null}, {"name'
            b'": "C3", "field_name": "C3", "pandas_type": "unicode", "numpy_ty'
            b'pe": "object", "metadata": null}, {"name": "Unnamed: 4", "field_'
            b'name": "Unnamed: 4", "pandas_type": "float64", "numpy_type": "fl'
            b'oat64", "metadata": null}, {"name": "Unnamed: 5", "field_name": '
            b'"Unnamed: 5", "pandas_type": "float64", "numpy_type": "float64",'
            b' "metadata": null}, {"name": "Unnamed: 6", "field_name": "Unname'
            b'd: 6", "pandas_type": "float64", "numpy_type": "float64", "metad'
            b'ata": null}, {"name": null, "field_name": "__index_level_0__", "'
            b'pandas_type": "int64", "numpy_type": "int64", "metadata": null}]'
            b', "pandas_version": "0.23.0"}'}

Answer 1

pyarrow目前尚不支持此功能。更具体地说，当前的限制是，不同片段/文件的所有模式都必须相同（不仅顺序，而且类型）。

肯定是改善这种情况并在读取木地板文件时进行一些模式规范化的计划（例如，有关不同类型的信息，请参见https://issues.apache.org/jira/browse/ARROW-2659）。对于此特定问题，有JIRA问题https://issues.apache.org/jira/browse/ARROW-2366涵盖了这种情况。

无法加载具有相同列名但顺序不同的镶木地板文件

1 个答案: