使用set_column将pyarrow表中的timestamp列转换为ms

时间:2018-09-15 20:48:18

标签: pandas pyarrow

我的目标是将json文件写入镶木地板。要进入一张桌子,我使用:

import pandas as pd
import pyarrow as pa

# for the workaround
from pandas.api.types import is_datetime64_any_dtype as is_dt
from collections import OrderedDict

# Json schema in pandas form
schema = {'col1': 'object', 
          #
          #
          'col17': 'datetime64[ms]',
          #
          # 
          }

# Load json into pandas dataframe
json_df = pd.read_json('/path/to/json_file', dtype = schema, lines = True, date_unit = 'ms')

# Convert to table
table = pa.Table.from_pandas(json_df)

# This doesn't work
converted_col = pa.column(pa.array(json_df['col17']).cast('timestamp[ms]'))
table.set_column(16, converted_col)  # still timestamp[ns] 

我最终将dtaframe解压缩为数组,然后将表重建为pyarrow:

# Workaround
cols = OrderedDict([(col_name, pa.array(json_df[col_name]).cast('timestamp[ms]') if 'dt' in dir(json_df[col_name]) else pa.array(json_df[col_name])) for col_name in json_df])
table = pa.Table.from_arrays(cols.values(), cols.keys()) # datetime column is in fact in ms this way

我没有及时了解性能差异,但是我认为第一种方法可能会更快。我第一种方式在做错什么?

0 个答案:

没有答案