使用NAT(null)将日期列从熊猫保存到拼花地板

时间:2020-07-13 21:41:11

标签: python-3.x pandas parquet amazon-athena pyarrow

我需要将整数格式的可为空的日期值('YYYYMMDD')读取到熊猫,然后将此熊猫数据帧以Date32 [Day]格式保存到Parquet,以便Athena Glue Crawler分类器将该列识别为日期。下面的代码不允许我将列保存到熊猫的木地板中:

import pandas as pd

dates = [None, "20200710", "20200711", "20200712"]
data_df = pd.DataFrame(dates, columns=['date'])
data_df['date'] = pd.to_datetime(data_df['date']).dt.date
data_df.to_parquet(r'my_path', engine='pyarrow')

我在下面收到此错误:

Traceback (most recent call last):
  File "", line 123, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow\array.pxi", line 265, in pyarrow.lib.array
  File "pyarrow\array.pxi", line 80, in pyarrow.lib._ndarray_to_array
TypeError: an integer is required (got type datetime.date)

如果我将None的值移到日期列表的末尾,这将毫无问题地起作用,并且pyarrow会将日期列推断为Date32[Day]。我的猜测是,由于dt.date的Pandas列类型是object加上该列的第一个值是NaT(不是时间),因此pyarrow无法将列推断为{ {1}}来自Pandas数据框或某个示例值,它推断该列为Date32[Day]。在不对列值进行排序的情况下,将此数据框列保存为实木复合地板作为Integer列的一种好方法是什么?谢谢。

2 个答案:

答案 0 :(得分:1)

您是对的。由于第一个值是NaT,因此您需要在不更改数据类型的情况下将其删除。我使用了以下代码。

import pandas as pd

dates = [None, "20200710", "20200711", "20200712"]
data_df = pd.DataFrame(dates, columns=['date'])
data_df['date'] = pd.to_datetime(data_df['date']).dt.date

# In addition, add this line to remove NaT without changing type
# Change strfttime as you want (I have used YMD)
data_df['date'] = [d.strftime('%Y-%m-%d') if not pd.isnull(d) else '' for d in data_df['date']]

data_df.to_parquet(r'my_path', engine='pyarrow')

我希望这对您有用,并且错误已得到解决。

答案 1 :(得分:0)

这是pyarrow 1.0(https://issues.apache.org/jira/browse/ARROW-842 / https://github.com/apache/arrow/pull/7537)中修复的错误。现在,上面的代码片段可以正常工作:

In [2]: dates = [None, "20200710", "20200711", "20200712"] 
   ...: data_df = pd.DataFrame(dates, columns=['date']) 
   ...: data_df['date'] = pd.to_datetime(data_df['date']).dt.date                                                                                                                                                  

In [3]: data_df                                                                                                                                                                                                    
Out[3]: 
         date
0         NaT
1  2020-07-10
2  2020-07-11
3  2020-07-12

In [4]: data_df.to_parquet(r'my_path', engine='pyarrow')                                                                                                                                                           

In [5]: import pyarrow.parquet as pq                                                                                                                                                                               

In [6]: pq.read_table(r'my_path')                                                                                                                                                                                  
Out[6]: 
pyarrow.Table
date: date32[day]