Question

我正在使用PyArrow来编写Python中某些Parquet数据框的Pandas个文件。

有没有办法可以指定写入镶木地板文件的逻辑类型？

例如，在PyArrow中编写np.uint32列会导致镶木地板文件中的INT64列，而使用fastparquet模块编写该列会导致逻辑类型为UINT_32的INT32列（这是我从PyArrow开始的行为）。

E.g：

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import fastparquet as fp
import numpy as np

df = pd.DataFrame.from_records(data=[(1, 'foo'), (2, 'bar')], columns=['id', 'name'])
df['id'] = df['id'].astype(np.uint32)

# write parquet file using PyArrow
pq.write_table(pa.Table.from_pandas(df, preserve_index=False), 'pyarrow.parquet')

# write parquet file using fastparquet
fp.write('fastparquet.parquet', df)

# print schemas of both written files
print('PyArrow:', pq.ParquetFile('pyarrow.parquet').schema)
print('fastparquet:', pq.ParquetFile('fastparquet.parquet').schema)

此输出：

PyArrow: <pyarrow._parquet.ParquetSchema object at 0x10ecf9048>
id: INT64
name: BYTE_ARRAY UTF8

fastparquet: <pyarrow._parquet.ParquetSchema object at 0x10f322848>
id: INT32 UINT_32
name: BYTE_ARRAY UTF8

我遇到了与其他列类型类似的问题，所以我们真正寻找一种通用方法来指定使用PyArrow编写时使用的逻辑类型。

Answer 1

默认情况下，PyArrow默认编写镶木地板版本1.0文件，使用UINT_32逻辑类型需要2.0版。

解决方案是在编写表时指定版本，即

pq.write_table(pa.Table.from_pandas(df, preserve_index=False), 'pyarrow.parquet', version='2.0')

这会导致编写预期的镶木地板模式。

从PyArrow编写Parquet文件时如何指定逻辑类型？

1 个答案: