我有一个看起来像这样的数据集...
pd.DataFrame({
'car_id': ['1', '1', '1', '1', '1', '1', '1', '1', '1','1','1','1'],
'odometer_start': [0, 3, 6, 9, 13, 18, 39, 89, 101, 107, 122, 182],
'odometer_end': [3, 6, 9, 13, 18, 39, 89, 101, 107, 122, 182, 206],
'need_maintanince': [0,0,1,0,0,0,1,0,1,0,1,0]
})
我本质上是想汇总自上次触发维护需求以来每次观察的里程表差异。
所以我希望它像这样出来:
pd.DataFrame({
'car_id': ['1', '1', '1', '1', '1', '1', '1', '1', '1','1','1','1'],
'odometer_start': [0, 3, 6, 9, 13, 18, 39, 89, 101, 107, 122, 182],
'odometer_end': [3, 6, 9, 13, 18, 39, 89, 101, 107, 122, 182, 206],
'need_maintanince': [0,0,1,0,0,0,1,0,1,0,1,0],
'miles_since_maint': [0,0,0,4,9,30,80,12,18,15,75,24]})
基本上,它将查看每个观察值,并对自对同一个car_id的观察值被标记为需要维修以来的行驶距离进行累计。然后它将继续累积自维护以来的里程。
作为参考,我试图预测需要修理汽车之前的里程数。
有人知道该怎么做吗?
编辑:
我认为我没有像预期的那样清楚地输出预期的输出。更新了它以符合我的需要,并使数据框更易于解释,因为多个汽车ID甚至使我感到困惑。
答案 0 :(得分:2)
IIUC:
s = df.groupby('car_id')['odometer_start'].diff()
df['miles_since_last_maint'] = np.where(df['need_maintanince'], s, 0)
给予
car_id odometer_start need_maintanince miles_since_last_maint
0 1 0 0 0.0
1 2 5 0 0.0
2 2 9 0 0.0
3 3 1 0 0.0
4 3 3 1 2.0
5 3 8 0 0.0
6 3 19 1 11.0
7 3 52 1 33.0
8 1 11 0 0.0
9 2 22 0 0.0
10 2 64 1 42.0
11 4 132 0 0.0
12 4 144 1 12.0
答案 1 :(得分:1)
这似乎可以为您提供所需的结果:
df = pd.DataFrame({
'car_id': ['1', '2', '2', '3', '3', '3', '3', '3', '1','2','2','4','4'],
'odometer_start': [0, 5, 9, 1,3, 8,19,52,11,22,64,132, 144],
'need_maintanince': [0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1]
})
df['miles_since_maint'] = (df.groupby('car_id')['odometer_start'].diff()
* df['need_maintanince']).fillna(0)
car_id ... miles_since_maint
0 1 ... 0.0
1 2 ... 0.0
2 2 ... 0.0
3 3 ... 0.0
4 3 ... 2.0
5 3 ... 0.0
6 3 ... 11.0
7 3 ... 33.0
8 1 ... 0.0
9 2 ... 0.0
10 2 ... 42.0
11 4 ... 0.0
12 4 ... 12.0
编辑每个评论
df = pd.DataFrame({
'car_id': ['1', '1', '1', '1', '1', '1', '1', '1', '1','1','1','1'],
'odometer_start': [0, 3, 6, 9, 13, 18, 39, 89, 101, 107, 122, 182],
'odometer_end': [3, 6, 9, 13, 18, 39, 89, 101, 107, 122, 182, 206],
'need_maintanince': [0,0,1,0,0,0,1,0,1,0,1,0],
'miles_since_maint': [0,0,0,4,9,30,80,12,18,15,75,24]})
df['odo_chg'] = df['odometer_end'] - df['odometer_start']
maint_group = df['need_maintanince'].shift().cumsum().fillna(0)
df['miles_since_maint_2'] = (df.groupby(['car_id', maint_group])['odo_chg'].cumsum())
# Reassign initial group to 0 per desired output
df.loc[maint_group == 0, 'miles_since_maint_2'] = 0
df.T
给予(转置以便于观看)
0 1 2 3 4 5 6 7 8 9 10 11
car_id 1 1 1 1 1 1 1 1 1 1 1 1
odometer_start 0 3 6 9 13 18 39 89 101 107 122 182
odometer_end 3 6 9 13 18 39 89 101 107 122 182 206
need_maintanince 0 0 1 0 0 0 1 0 1 0 1 0
miles_since_maint 0 0 0 4 9 30 80 12 18 15 75 24
odo_chg 3 3 3 4 5 21 50 12 6 15 60 24
miles_since_maint_2 0 0 0 4 9 30 80 12 18 15 75 24
答案 2 :(得分:1)
与Quang Hoang的答案类似,但作为一线无麻木:
df['miles_since_last_maint'] = df.groupby('car_id')['odometer_start'].diff().where(df.need_maintanince==1,0).astype(int)
结果:
car_id need_maintanince odometer_start miles_since_last_maint
0 1 0 0 0
1 2 0 5 0
2 2 0 9 0
3 3 0 1 0
4 3 1 3 2
5 3 0 8 0
6 3 1 19 11
7 3 1 52 33
8 1 0 11 0
9 2 0 22 0
10 2 1 64 42
11 4 0 132 0
12 4 1 144 12