向量化日期时间熊猫比较

时间:2021-02-01 17:55:20

标签: python pandas datetime vectorization

我最近阅读了一篇很棒的文章 (https://towardsdatascience.com/apply-function-to-pandas-dataframe-rows-76df74165ee4),它表明矢量化比迭代快得多,并希望实践。我当前的代码,在 200 万行上,需要大约 16 小时才能完成以下保存在 Pandas DataFrame 对象“data”中的示例:

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2274589 entries, 0 to 2274588
Data columns (total 5 columns):
 #   Column    Dtype         
---  ------    -----         
 0   Date      object        
 1   Time      object        
 2   Open      float64       
 3   Close     float64       
 4   datetime  datetime64[ns]
print(data)
               Date   Time      Open     Close            datetime
0        02/10/2012  07:26  191.9500  191.9500 2012-02-10 07:26:00
1        02/10/2012  07:56  191.6600  191.6600 2012-02-10 07:56:00
2        02/10/2012  08:00  191.9400  191.9400 2012-02-10 08:00:00
3        02/10/2012  09:30  191.7500  191.7500 2012-02-10 09:30:00
4        02/10/2012  09:54  191.8500  191.8500 2012-02-10 09:54:00

工作代码删除了上午 9:30 之前和下午 3:59 之后的时间:

keep=[]
end = data.shape[0]
for row in data.itertuples(index=True):
    if (row.datetime < datetime(year = row.datetime.year, month = row.datetime.month, day = row.datetime.day, hour = 9, minute = 30, second = 0)):
        pass
    elif (row.datetime.hour > 16): # closes at 15:59 (keep it!) in this database's notation
        pass
    else:
        keep.append(row[0])
    print(row[0], "/", end)
data = data.loc[keep, :] 

矢量化对我来说是新的,我尝试过一些操作,但我觉得因为它是一个系列,比较或设置值是一个问题,因为它不是一个单独的数字。从阅读来看,我似乎需要做一个函数,这样我才能做到: data['keep_it'] = my_fun(data['datetime'])

失败的尝试:

data['keep_it'] = my_fun(data['datetime'])
def my_fun(row): # returns 1 if desired to keep  # a vectorized approach
    if (row < datetime.date(year = row.year, month = row.month, day = row.day, hour = 9, minute = 30, second = 0)):
        return 1
     # AttributeError: 'Series' object has no attribute 'year'
    if (row < pd.to_datetime(str(row['datetime'].year) +'/' + str(row['datetime'].month) +'/' + str(row['datetime'].day) + 'T9:30:00')):
        return 1
    # AttributeError: 'Series' object has no attribute 'year'

有什么想法吗? 谢谢!

2 个答案:

答案 0 :(得分:1)

这是矢量化的。

import datetime as dt

df = pd.read_csv(io.StringIO("""    Date   Time      Open     Close            datetime
0        02/10/2012  07:26  191.9500  191.9500  2012-02-10 07:26:00
1        02/10/2012  07:56  191.6600  191.6600  2012-02-10 07:56:00
2        02/10/2012  08:00  191.9400  191.9400  2012-02-10 08:00:00
3        02/10/2012  09:30  191.7500  191.7500  2012-02-10 09:30:00
4        02/10/2012  09:54  191.8500  191.8500  2012-02-10 09:54:00"""), sep="\s\s+", engine="python")

df["datetime"] = pd.to_datetime(df["datetime"])
df.loc[df["datetime"].dt.time.between(dt.time(9,30),dt.time(15,59))]

答案 1 :(得分:1)

感谢@MrFuppes,我设计了这种基本上是即时的粗略方法:

testing = pd.DatetimeIndex(data['datetime'])
data = data[(testing.hour<16) & (testing.hour*60+testing.minute >= 9*60+30)] 

改进的空间包括使用单行删除测试,并可能正确利用 DateTimeIndex .time 属性