我的目标是将json文件写入镶木地板。要进入一张桌子,我使用:
import pandas as pd
import pyarrow as pa
# for the workaround
from pandas.api.types import is_datetime64_any_dtype as is_dt
from collections import OrderedDict
# Json schema in pandas form
schema = {'col1': 'object',
#
#
'col17': 'datetime64[ms]',
#
#
}
# Load json into pandas dataframe
json_df = pd.read_json('/path/to/json_file', dtype = schema, lines = True, date_unit = 'ms')
# Convert to table
table = pa.Table.from_pandas(json_df)
# This doesn't work
converted_col = pa.column(pa.array(json_df['col17']).cast('timestamp[ms]'))
table.set_column(16, converted_col) # still timestamp[ns]
我最终将dtaframe解压缩为数组,然后将表重建为pyarrow:
# Workaround
cols = OrderedDict([(col_name, pa.array(json_df[col_name]).cast('timestamp[ms]') if 'dt' in dir(json_df[col_name]) else pa.array(json_df[col_name])) for col_name in json_df])
table = pa.Table.from_arrays(cols.values(), cols.keys()) # datetime column is in fact in ms this way
我没有及时了解性能差异,但是我认为第一种方法可能会更快。我第一种方式在做错什么?