我想写一些数据,其中一些列是字符串数组或结构数组(通常是键值对)到Parquet文件中,以便在AWS Athena中使用。
在找到两个支持写Parquet文件的Python库(Arrow和fastparquet)之后,我一直在努力尝试实现结构数组。
编写Parquet文件问题的最佳答案列出了这两个库(并且确实提到缺乏对嵌套数据的支持)。
那么有没有办法从Python中将嵌套数据写入Parquet文件?
我尝试使用箭头进行以下操作以存储键/值。
import pyarrow as pa
import pyarrow.parquet as pq
countries = []
populations = []
countries.append('Sweden')
populations.append([{'city': 'Stockholm', 'population': 1515017}, {'city': 'Gothenburg', 'population': 590580}])
countries.append('Norway')
populations.append([{'city': 'Oslo', 'population': 958378}, {'city': 'Bergen', 'population': 254235}])
ty = pa.struct([pa.field('city', pa.string()),
pa.field('population', pa.int32())
])
fields = [
pa.field('country', pa.string()),
pa.field('populations', pa.list_(ty)),
]
sch1 = pa.schema(fields)
data = [
pa.array(countries),
pa.array(populations, type=pa.list_(ty))
]
batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1'])
table = pa.Table.from_batches([batch], sch1)
writer = pq.ParquetWriter('cities.parquet', sch1)
writer.write_table(table)
writer.close()
当我运行代码时,我收到以下消息:
Traceback (most recent call last):
File "stackoverflow.py", line 30, in <module>
writer.write_table(table)
File "/Users/moonhouse/anaconda2/envs/parquet/lib/python3.6/site-packages/pyarrow/parquet.py", line 327, in write_table
self.writer.write_table(table, row_group_size=row_group_size)
File "_parquet.pyx", line 955, in pyarrow._parquet.ParquetWriter.write_table
File "error.pxi", line 77, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Nested column branch had multiple children
具有相同错误消息的answer in a recent Arrow JIRA ticket表明正在进行支持结构的工作,尽管我不清楚它是否包括写作或只是阅读这些结构。
当我尝试使用 fastparquet 存储数据时(就像我有一个字符串列表一样):
import pandas as pd
from fastparquet import write
data = [{ 'cities': ['Stockholm', 'Copenhagen', 'Oslo', 'Helsinki']}]
df = pd.DataFrame(data)
write('test.parq', df, compression='SNAPPY')
没有给出错误信息,但在镶木地板工具中查看时,我注意到数据是Base64编码的JSON。
cities = WyJTdG9ja2hvbG0iLCAiQ29wZW5oYWdlbiIsICJPc2xvIiwgIkhlbHNpbmtpIl0=