我正在做一个分类问题,其中我试图预测第二天是否会给汽车加油。
数据由日期,每辆汽车的ID和指示该特定日期是否为汽车加油的虚拟变量组成。
我要实现的是“ days_since_refuelled”列。这应该被计算为自最后一次加油以来的天数== 1,并且显然必须针对每个car_id分别计算。如果以前没有加油== 1的实例,则该值应等于-1。
所需的输出应如下所示:
date car_id refuelled days_since_refuelled
01-01-2019 1 0 -1
01-01-2019 2 1 -1
01-01-2019 3 1 -1
06-01-2019 1 0 -1
06-01-2019 2 0 5
06-01-2019 3 0 5
09-01-2019 1 1 -1
09-01-2019 2 0 8
09-01-2019 3 0 8
14-01-2019 1 0 5
14-01-2019 2 1 13
14-01-2019 3 0 13
答案 0 :(得分:5)
按Series.where
将1
的行转换为NaN
,然后按每组的ffill
转换Series.shift
,然后将date
列减去{ {3}},最后将时间增量转换为Series.sub
,并将缺失的值替换为Series.dt.days
:
#convert to datetimes
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df['days_since_refuelled'] = df['date'].where(df['refuelled'].eq(1))
df['days_since_refuelled'] = (df['date'].sub(df.groupby('car_id')['days_since_refuelled']
.apply(lambda x: x.shift().ffill())
)
.dt.days
.fillna(-1)
.astype(int))
print (df)
date car_id refuelled days_since_refulled days_since_refuelled
0 2009-01-01 1 0 -1 -1
1 2019-01-01 2 1 -1 -1
2 2019-01-01 3 1 -1 -1
3 2019-01-06 1 0 -1 -1
4 2019-01-06 2 0 5 5
5 2019-01-06 3 0 5 5
6 2019-01-09 1 1 -1 -1
7 2019-01-09 2 0 8 8
8 2019-01-09 3 0 8 8
9 2019-01-14 1 0 5 5
10 2019-01-14 2 1 13 13
11 2019-01-14 3 0 13 13