我有一个数据框,如下所示。
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
2 1 P 2017-10-22 100
3 1 F 2018-06-22 600
4 1 M 2018-08-22 150
5 1 P 2018-10-22 120
6 1 F 2019-03-22 750
7 2 M 2017-06-29 200
8 2 F 2017-09-29 600
9 2 F 2018-01-29 500
10 2 M 2018-03-29 100
11 2 P 2018-08-29 100
12 2 M 2018-10-29 100
13 2 F 2018-12-29 500
14 3 M 2017-03-20 300
15 3 F 2018-06-20 700
16 3 P 2018-08-20 100
17 3 M 2018-10-20 250
18 3 F 2018-11-20 100
19 3 P 2018-12-20 100
20 3 F 2019-03-20 600
21 3 M 2019-05-20 200
22 4 M 2017-08-10 800
23 4 F 2018-06-10 100
24 4 P 2018-08-10 120
25 4 F 2018-10-10 500
26 4 M 2019-01-10 200
27 4 F 2019-06-10 600
其中
F =失败,
M =维护
P =已计划
第1步-如果每个ID的最后一个原始值不是F,则忽略这些行,该行将在下面显示,如下所示。
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
2 1 P 2017-10-22 100
3 1 F 2018-06-22 600
4 1 M 2018-08-22 150
5 1 P 2018-10-22 120
6 1 F 2019-03-22 750
7 2 M 2017-06-29 200
8 2 F 2017-09-29 600
9 2 F 2018-01-29 500
10 2 M 2018-03-29 100
11 2 P 2018-08-29 100
12 2 M 2018-10-29 100
13 2 F 2018-12-29 500
14 3 M 2017-03-20 300
15 3 F 2018-06-20 700
16 3 P 2018-08-20 100
17 3 M 2018-10-20 250
18 3 F 2018-11-20 100
19 3 P 2018-12-20 100
20 3 F 2019-03-20 600
22 4 M 2017-08-10 800
23 4 F 2018-06-10 100
24 4 P 2018-08-10 120
25 4 F 2018-10-10 500
26 4 M 2019-01-10 200
27 4 F 2019-06-10 600
现在,对于每个ID,最后一个状态为失败
然后从上面的df中,我想在数据框下面准备
ID No_of_F No_of_M No_of_P SLS NoDays_to_SLS NoDays_SLS_to_LS
1 3 2 2 P 487 151
2 3 3 2 M 487 61
3 3 2 2 P 640 90
4 3 1 1 M 518 151
SLS =倒数第二个状态
LS =最后状态
我尝试了以下代码来计算持续时间。
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
df['D'] = df.groupby('ID')['Date'].diff().dt.days
ID Status Date Cost D
0 1 F 2017-06-22 500 nan
1 1 M 2017-07-22 100 30.00
2 1 P 2017-10-22 100 92.00
3 1 F 2018-06-22 600 243.00
4 1 M 2018-08-22 150 61.00
5 1 P 2018-10-22 120 61.00
6 1 F 2019-03-22 750 151.00
7 2 M 2017-06-29 200 nan
8 2 F 2017-09-29 600 92.00
9 2 F 2018-01-29 500 122.00
10 2 M 2018-03-29 100 59.00
11 2 P 2018-08-29 100 153.00
12 2 M 2018-10-29 100 61.00
13 2 F 2018-12-29 500 61.00
14 3 M 2017-03-20 300 nan
15 3 F 2018-06-20 700 457.00
16 3 P 2018-08-20 100 61.00
17 3 M 2018-10-20 250 61.00
18 3 F 2018-11-20 100 31.00
19 3 P 2018-12-20 100 30.00
20 3 F 2019-03-20 600 90.00
21 3 M 2019-05-20 200 61.00
22 4 M 2017-08-10 800 nan
23 4 F 2018-06-10 100 304.00
24 4 P 2018-08-10 120 61.00
25 4 F 2018-10-10 500 61.00
26 4 M 2019-01-10 200 92.00
27 4 F 2019-06-10 600 151.00
答案 0 :(得分:3)
您可以使用groupby.cumsum
和transform.idxmax
创建掩码以过滤DataFrame
:
s = df.Status.eq('F').groupby(df.ID).cumsum().groupby(df.ID).transform('idxmax')
mask = s.index <= s
df_filtered = df[mask]
然后,您可以将pandas.crosstab
和groupby.apply
与iloc
结合使用来创建其他列:
df_new = pd.crosstab(df_filtered.ID, df_filtered.Status).add_prefix('No_of_')
df_new['SLS'] = df_filtered.groupby('ID')['Status'].apply(lambda x: x.iloc[-2])
df_new['NoDays_to_SLS'] = (df_filtered.groupby('ID')['Date'].apply(lambda x: x.iloc[-2]) - df_filtered.groupby('ID')['Date'].first()).dt.days
df_new['NoDays_SLS_to_LS'] = (df_filtered.groupby('ID')['Date'].last() - df_filtered.groupby('ID')['Date'].apply(lambda x: x.iloc[-2])).dt.days
[出]
Status No_of_F No_of_M No_of_P SLS NoDays_to_SLS NoDays_SLS_to_LS
ID
1 3 2 2 P 487 151
2 3 3 1 M 487 61
3 3 2 2 P 640 90
4 3 2 1 M 518 151
答案 1 :(得分:1)
另一种方法。它不需要最后一行为'F',因此不需要.iloc [-2]:
# the following two lines are only required if the date column is a string
# and the dataframe is not already sorted
df['Date']= df['Date'].astype('datetime64')
df.sort_values(['ID', 'Date'], axis='index', ascending=True, inplace=True)
# add some auxillary columns
df['aux_date']= df['Date'].where(df['Status'] != 'F', pd.NaT)
df= pd.concat([df, pd.get_dummies(df['Status'])], axis='columns')
# calculate the relevant dates and the sums of the statuses
grouped= df.groupby('ID')
aggregated= grouped.agg({'Date': 'min', 'aux_date': 'max', 'No_of_F': 'sum', 'No_of_M': 'sum', 'No_of_P': 'sum'})
aggregated['max_date']= grouped['Date'].apply(max)
# calculate the date differences
aggregated.loc[~aggregated['aux_date'].isnull(), 'NoDays_to_SLS']= aggregated['aux_date'] - aggregated['Date']
aggregated.loc[~aggregated['aux_date'].isnull(), 'NoDays_SLS_to_LS']= aggregated['max_date'] - aggregated['aux_date']
# drop the columns that were just inserted for calculation
aggregated.drop(['Date', 'aux_date', 'max_date'], axis='columns', inplace=True)
输出为:
No_of_F No_of_M No_of_P NoDays_to_SLS NoDays_SLS_to_LS
ID
1 3 2 2 487 days 151 days
2 3 3 1 487 days 61 days
3 3 2 2 640 days 90 days
4 3 2 1 518 days 151 days
5 2 2 2 487 days 0 days
注意:我插入ID 5来表示我的意思是我不依赖最后一条记录为F。ID5只是ID 1行的副本,其中最后一条F行已删除。这就是NoDays_SLS_to_LS
为该行为零的原因。
NoDays...
列的类型为timedelta64。如果您想将它们存储为整数,则可以执行以下操作:
aggregated['NoDays_to_SLS']= aggregated['NoDays_to_SLS'].dt.days
aggregated['NoDays_SLS_to_LS']= aggregated['NoDays_SLS_to_LS'].dt.days