Question

我跟随pyarrow data types for columns that have lists of dictionaries?创建了一个Arrow表，其中包含MapType列。

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

print(f'PyArrow Version = {pa.__version__}')
print(f'Pandas Version = {pd.__version__}')

df = pd.DataFrame({
        'col1': pd.Series([
            [('id', 'something'), ('value2', 'else')],
            [('id', 'something2'), ('value','else2')],
        ]),
        'col2': pd.Series(['foo', 'bar'])
    }
)

udt = pa.map_(pa.string(), pa.string())
schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])
table = pa.Table.from_pandas(df, schema)
pq.write_table(table, './test_map.parquet')

上面的代码在我正在开发的计算机上顺利运行：

PyArrow Version = 1.0.1
Pandas Version = 1.1.2

并成功生成了test_map.parquet文件。

然后，我使用镶木地板工具（1.11.1）读取文件，但得到以下输出：

col1:
.key_value:
.key_value:
col2 = foo

col1:
.key_value:
.key_value:
col2 = bar

键和值丢失... 你能帮我吗？

Answer 1

我们已于2020年9月30日向Apache Arrow提交了JIRA问题：https://issues.apache.org/jira/browse/ARROW-10140

该问题已在2020年10月20日发布的PyArrow 2.0.0中得到解决。

因此，如果在使用地图类型时遇到相同的问题，请将PyArrow升级到2.0.0（或更高版本）。

Answer 2

我尝试复制，但出现此错误：

pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet files not yet supported: key_value: list<key_value: struct<key: string not null, value: string> not null> not null

从镶木地板上读取时，结构和地图的错误列表中都没有很好地支持

。

对于这样的数据，我建议使用一种更简单的模式：

df = pd.DataFrame({
        'col1': pd.Series([
            {'id': 'something', 'value':'else'},
            {'id': 'somethings', 'value':'elses'},
        ]),
        'col2': pd.Series(['foo', 'bar'])
    }
)
 
udt = pa.struct([pa.field('id', pa.string()), pa.field('value', pa.string())])
schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])

table = pa.Table.from_pandas(df, schema)

输出：

+----------------------------------------+--------+
| col1                                   | col2   |
|----------------------------------------+--------|
| {'id': 'something', 'value': 'else'}   | foo    |
| {'id': 'somethings', 'value': 'elses'} | bar    |
+----------------------------------------+--------+

从pyarrow和pandas创建的镶木地板文件的地图列无数据

2 个答案: