根据另一列熊猫快速实现按天递增日期

时间:2019-01-09 13:43:58

标签: pandas datetime

我有以下熊猫DF:

print(df.to_dict())
{'Date_Installed': {11885: Timestamp('2018-11-15 00:00:00'), 111885: Timestamp('2018-11-15 00:00:00')}, 'days_from_instalation': {11885: 2, 111885: 3}}

我想创建一个新列,将'Date_Installed'列从'days_from_instalation'列开始按天递增

我知道可以使用apply()方法,如下所示:

from datetime import timedelta
df['desired_date']=df.apply(lambda row:row['Date_Installed']+timedelta(row['days_from_instalation']), axis=1)

产生我想要的输出:

print(df.to_dict())

{'Date_Installed': {11885: Timestamp('2018-11-15 00:00:00'), 111885: Timestamp('2018-11-15 00:00:00')}, 'days_from_instalation': {11885: 2, 111885: 3}, 'desired_date': {11885: Timestamp('2018-11-17 00:00:00'), 111885: Timestamp('2018-11-18 00:00:00')}}

但是这种方法非常慢,并且不适用于我的完整DF。

我不会对像这样的大熊猫增加日期的几个问题:

pandas-increment-datetime

但是它们似乎都在处理常量增量,而没有任何矢量化方法。

此增量类型是否有矢量化版本?

谢谢!

1 个答案:

答案 0 :(得分:3)

添加由to_timedelta创建的时间增量:

df['desired_date'] = df['Date_Installed'] +
                        pd.to_timedelta(df['days_from_instalation'], unit='d')

print (df)
       Date_Installed  days_from_instalation desired_date
11885      2018-11-15                      2   2018-11-17
111885     2018-11-15                      3   2018-11-18

另一个numpy解决方案速度更快,但是丢失了时区(如果指定了):

a = pd.to_timedelta(df['days_from_instalation'], unit='d').values.astype(np.int64)
df['desired_date1'] = pd.to_datetime(df['Date_Installed'].values.astype(np.int64)+a, unit='ns')

性能

#20krows
df = pd.concat([df] * 10000, ignore_index=True)

In [217]: %timeit df['desired_date1'] = pd.to_datetime(df['Date_Installed'].values.astype(np.int64) + pd.to_timedelta(df['days_from_instalation'], unit='d').values.astype(np.int64), unit='ns')
886 µs ± 9.92 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [218]: %timeit df['desired_date'] = df['Date_Installed'] + pd.to_timedelta(df['days_from_instalation'], unit='d')
1.53 ms ± 82.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)