Question

我想使用PyArrow在拼花文件中存储以下熊猫数据框：

import pandas as pd
df = pd.DataFrame({'field': [[{}, {}]]})

field列的类型是字典列表：

      field
0  [{}, {}]

我首先定义相应的PyArrow模式：

import pyarrow as pa
schema = pa.schema([pa.field('field', pa.list_(pa.struct([])))])

然后我使用from_pandas()：

table = pa.Table.from_pandas(df, schema=schema, preserve_index=False)

这将引发以下异常：

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "table.pxi", line 930, in pyarrow.lib.Table.from_pandas
  File "/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 371, in dataframe_to_arrays
    convert_types)]
  File "/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 370, in <listcomp>
    for c, t in zip(columns_to_convert,
  File "/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 366, in convert_column
    return pa.array(col, from_pandas=True, type=ty)
  File "array.pxi", line 177, in pyarrow.lib.array
  File "error.pxi", line 77, in pyarrow.lib.check_status
  File "error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Unknown list item type: struct<>

我做错了还是PyArrow不支持此操作？

我使用pyarrow 0.9.0，pandas 23.4，python 3.6。

Answer 1

到目前为止，使用列表和结构的混合作为列数据类型是Apache Arrow（PyArrow的基础库）中尚未实现的功能。 This Jira issue跟踪该主题的进度。

使用PyArrow 0.15.0，已经可以从具有嵌套类型的pandas数据框中创建一个pyarrow表，但是无法将该表保存在Parquet文件中（或将其转换回pandas数据框）：

import pandas as pd
import pyarrow as pa
import pyarrow.parquet

df = pd.DataFrame({'field': [[{'a': 1}, {'a': 2}]]})
schema = pa.schema(
    [pa.field('field', pa.list_(pa.struct([('a', pa.int64())])))])
table = pa.Table.from_pandas(df, schema=schema, preserve_index=False)
pyarrow.parquet.write_table(table, 'test.parquet')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/bla.py", line 11, in <module>
    pyarrow.parquet.write_table(table, 'test.parquet')
  File "/anaconda3/lib/python3.7/site-packages/pyarrow/parquet.py", line 1344, in write_table
    writer.write_table(table, row_group_size=row_group_size)
  File "/anaconda3/lib/python3.7/site-packages/pyarrow/parquet.py", line 474, in write_table
    self.writer.write_table(table, row_group_size=row_group_size)
  File "pyarrow/_parquet.pyx", line 1375, in pyarrow._parquet.ParquetWriter.write_table
  File "pyarrow/error.pxi", line 86, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Level generation for Struct not supported yet

Answer 2

以下是重现此错误的代码段：

#!/usr/bin/env python3
import pandas as pd  # type: ignore


def main():
    """Main function"""
    df = pd.DataFrame()
    df["nested"] = [[dict()] for i in range(10)]

    df.to_feather("test.feather")
    print("Success once")
    df = pd.read_feather("test.feather")
    df.to_feather("test.feather")


if __name__ == "__main__":
    main()

请注意，从熊猫到羽毛，没有任何中断，但是一旦数据帧从羽毛加载并尝试回写，它就会中断。

要解决此问题，只需更新到pyarrow 2.0.0：

pip3 install pyarrow==2.0.0

截至2020-11-16的可用pyarrow版本：

0.9.0、0.10.0、0.11.0、0.11.1、0.12.0、0.12.1、0.13.0、0.14.0、0.15.1、0.16.0、0.17.0、0.17.1 ，1.0.0、1.0.1、2.0.0

Answer 3

我已经能够将列中具有数组的熊猫数据帧保存为实木复合地板，并通过将object的dataframe dtypes转换为str，将它们从实木复合地板读回数据框架。

def mapTypes(x):
    return {'object': 'str', 'int64': 'int64', 'float64': 'float64', 'bool': 'bool',
            'datetime64[ns, ' + timezone + ']': 'datetime64[ns, ' + timezone + ']'}.get(x,"str")  # string is     default if type not mapped

table_names = [x for x in df.columns]
table_types = [mapTypes(x.name) for x in df.dtypes]
parquet_table = dict(zip(table_names, table_types))    
df_pq = df.astype(parquet_table)
import awswrangler as wr

wr.s3.to_parquet(df=df_pq,path=path,dataset=True,database='test',mode='overwrite',table=table.lower(),partition_cols=['realmid'],sanitize_columns=True)

下面的图片显示了使用AWS datawrangler库从s3中存储的拼花文件到数据帧的读取，我也使用pyarrow

完成了此操作

PyArrow：使用嵌套类型在字典中存储字典列表

3 个答案: