我需要将整数格式的可为空的日期值('YYYYMMDD')读取到熊猫,然后将此熊猫数据帧以Date32 [Day]格式保存到Parquet,以便Athena Glue Crawler分类器将该列识别为日期。下面的代码不允许我将列保存到熊猫的木地板中:
import pandas as pd
dates = [None, "20200710", "20200711", "20200712"]
data_df = pd.DataFrame(dates, columns=['date'])
data_df['date'] = pd.to_datetime(data_df['date']).dt.date
data_df.to_parquet(r'my_path', engine='pyarrow')
我在下面收到此错误:
Traceback (most recent call last):
File "", line 123, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
File "pyarrow\array.pxi", line 265, in pyarrow.lib.array
File "pyarrow\array.pxi", line 80, in pyarrow.lib._ndarray_to_array
TypeError: an integer is required (got type datetime.date)
如果我将None
的值移到日期列表的末尾,这将毫无问题地起作用,并且pyarrow会将日期列推断为Date32[Day]
。我的猜测是,由于dt.date
的Pandas列类型是object
加上该列的第一个值是NaT
(不是时间),因此pyarrow无法将列推断为{ {1}}来自Pandas数据框或某个示例值,它推断该列为Date32[Day]
。在不对列值进行排序的情况下,将此数据框列保存为实木复合地板作为Integer
列的一种好方法是什么?谢谢。
答案 0 :(得分:1)
您是对的。由于第一个值是NaT,因此您需要在不更改数据类型的情况下将其删除。我使用了以下代码。
import pandas as pd
dates = [None, "20200710", "20200711", "20200712"]
data_df = pd.DataFrame(dates, columns=['date'])
data_df['date'] = pd.to_datetime(data_df['date']).dt.date
# In addition, add this line to remove NaT without changing type
# Change strfttime as you want (I have used YMD)
data_df['date'] = [d.strftime('%Y-%m-%d') if not pd.isnull(d) else '' for d in data_df['date']]
data_df.to_parquet(r'my_path', engine='pyarrow')
我希望这对您有用,并且错误已得到解决。
答案 1 :(得分:0)
这是pyarrow 1.0(https://issues.apache.org/jira/browse/ARROW-842 / https://github.com/apache/arrow/pull/7537)中修复的错误。现在,上面的代码片段可以正常工作:
In [2]: dates = [None, "20200710", "20200711", "20200712"]
...: data_df = pd.DataFrame(dates, columns=['date'])
...: data_df['date'] = pd.to_datetime(data_df['date']).dt.date
In [3]: data_df
Out[3]:
date
0 NaT
1 2020-07-10
2 2020-07-11
3 2020-07-12
In [4]: data_df.to_parquet(r'my_path', engine='pyarrow')
In [5]: import pyarrow.parquet as pq
In [6]: pq.read_table(r'my_path')
Out[6]:
pyarrow.Table
date: date32[day]