根据条件汇总数据

时间:2019-07-23 20:59:40

标签: python pandas

我有一个看起来像这样的数据集...

pd.DataFrame({
 'car_id': ['1', '1', '1', '1', '1', '1', '1', '1', '1','1','1','1'],
 'odometer_start': [0, 3, 6, 9, 13, 18, 39, 89, 101, 107, 122, 182],
 'odometer_end': [3, 6, 9, 13, 18, 39, 89, 101, 107, 122, 182, 206],
 'need_maintanince': [0,0,1,0,0,0,1,0,1,0,1,0]
 })

我本质上是想汇总自上次触发维护需求以来每次观察的里程表差异。

所以我希望它像这样出来:

pd.DataFrame({
 'car_id': ['1', '1', '1', '1', '1', '1', '1', '1', '1','1','1','1'],
 'odometer_start': [0, 3, 6, 9, 13, 18, 39, 89, 101, 107, 122, 182],
 'odometer_end': [3, 6, 9, 13, 18, 39, 89, 101, 107, 122, 182, 206],
 'need_maintanince': [0,0,1,0,0,0,1,0,1,0,1,0],
 'miles_since_maint': [0,0,0,4,9,30,80,12,18,15,75,24]})

基本上,它将查看每个观察值,并对自对同一个car_id的观察值被标记为需要维修以来的行驶距离进行累计。然后它将继续累积自维护以来的里程。

作为参考,我试图预测需要修理汽车之前的里程数。

有人知道该怎么做吗?

编辑:

我认为我没有像预期的那样清楚地输出预期的输出。更新了它以符合我的需要,并使数据框更易于解释,因为多个汽车ID甚至使我感到困惑。

3 个答案:

答案 0 :(得分:2)

IIUC:

s = df.groupby('car_id')['odometer_start'].diff()
df['miles_since_last_maint'] = np.where(df['need_maintanince'], s, 0)

给予

   car_id  odometer_start  need_maintanince  miles_since_last_maint
0       1               0                 0                     0.0
1       2               5                 0                     0.0
2       2               9                 0                     0.0
3       3               1                 0                     0.0
4       3               3                 1                     2.0
5       3               8                 0                     0.0
6       3              19                 1                    11.0
7       3              52                 1                    33.0
8       1              11                 0                     0.0
9       2              22                 0                     0.0
10      2              64                 1                    42.0
11      4             132                 0                     0.0
12      4             144                 1                    12.0

答案 1 :(得分:1)

这似乎可以为您提供所需的结果:

df = pd.DataFrame({
 'car_id': ['1', '2', '2', '3', '3', '3', '3', '3', '1','2','2','4','4'],
 'odometer_start': [0, 5, 9, 1,3, 8,19,52,11,22,64,132, 144],
 'need_maintanince': [0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1]
 })

df['miles_since_maint'] = (df.groupby('car_id')['odometer_start'].diff() 
                            * df['need_maintanince']).fillna(0)
   car_id        ...          miles_since_maint
0       1        ...                        0.0
1       2        ...                        0.0
2       2        ...                        0.0
3       3        ...                        0.0
4       3        ...                        2.0
5       3        ...                        0.0
6       3        ...                       11.0
7       3        ...                       33.0
8       1        ...                        0.0
9       2        ...                        0.0
10      2        ...                       42.0
11      4        ...                        0.0
12      4        ...                       12.0

编辑每个评论

df = pd.DataFrame({
 'car_id': ['1', '1', '1', '1', '1', '1', '1', '1', '1','1','1','1'],
 'odometer_start': [0, 3, 6, 9, 13, 18, 39, 89, 101, 107, 122, 182],
 'odometer_end': [3, 6, 9, 13, 18, 39, 89, 101, 107, 122, 182, 206],
 'need_maintanince': [0,0,1,0,0,0,1,0,1,0,1,0],
 'miles_since_maint': [0,0,0,4,9,30,80,12,18,15,75,24]})

df['odo_chg'] = df['odometer_end'] - df['odometer_start']
maint_group = df['need_maintanince'].shift().cumsum().fillna(0)
df['miles_since_maint_2'] = (df.groupby(['car_id', maint_group])['odo_chg'].cumsum())
# Reassign initial group to 0 per desired output
df.loc[maint_group == 0, 'miles_since_maint_2'] = 0
df.T

给予(转置以便于观看)

                    0  1  2   3   4   5   6    7    8    9    10   11
car_id               1  1  1   1   1   1   1    1    1    1    1    1
odometer_start       0  3  6   9  13  18  39   89  101  107  122  182
odometer_end         3  6  9  13  18  39  89  101  107  122  182  206
need_maintanince     0  0  1   0   0   0   1    0    1    0    1    0
miles_since_maint    0  0  0   4   9  30  80   12   18   15   75   24
odo_chg              3  3  3   4   5  21  50   12    6   15   60   24
miles_since_maint_2  0  0  0   4   9  30  80   12   18   15   75   24

答案 2 :(得分:1)

与Quang Hoang的答案类似,但作为一线无麻木:

df['miles_since_last_maint'] = df.groupby('car_id')['odometer_start'].diff().where(df.need_maintanince==1,0).astype(int)

结果:

   car_id  need_maintanince  odometer_start  miles_since_last_maint
0       1                 0               0                       0
1       2                 0               5                       0
2       2                 0               9                       0
3       3                 0               1                       0
4       3                 1               3                       2
5       3                 0               8                       0
6       3                 1              19                      11
7       3                 1              52                      33
8       1                 0              11                       0
9       2                 0              22                       0
10      2                 1              64                      42
11      4                 0             132                       0
12      4                 1             144                      12