Question

我有一个数据框，如下所示。

   ID   Status  Date        Cost
0   1   F       2017-06-22  500
1   1   M       2017-07-22  100
2   1   P       2017-10-22  100
3   1   F       2018-06-22  600
4   1   M       2018-08-22  150
5   1   P       2018-10-22  120
6   1   F       2019-03-22  750
7   2   M       2017-06-29  200
8   2   F       2017-09-29  600
9   2   F       2018-01-29  500
10  2   M       2018-03-29  100
11  2   P       2018-08-29  100
12  2   M       2018-10-29  100
13  2   F       2018-12-29  500
14  3   M       2017-03-20  300
15  3   F       2018-06-20  700
16  3   P       2018-08-20  100
17  3   M       2018-10-20  250
18  3   F       2018-11-20  100
19  3   P       2018-12-20  100
20  3   F       2019-03-20  600
21  3   M       2019-05-20  200
22  4   M       2017-08-10  800
23  4   F       2018-06-10  100
24  4   P       2018-08-10  120
25  4   F       2018-10-10  500
26  4   M       2019-01-10  200
27  4   F       2019-06-10  600

其中

F =失败，

M =维护

P =已计划

第1步-如果每个ID的最后一个原始值不是F，则忽略这些行，该行将在下面显示，如下所示。

ID  Status  Date        Cost
0   1   F   2017-06-22  500
1   1   M   2017-07-22  100
2   1   P   2017-10-22  100
3   1   F   2018-06-22  600
4   1   M   2018-08-22  150
5   1   P   2018-10-22  120
6   1   F   2019-03-22  750
7   2   M   2017-06-29  200
8   2   F   2017-09-29  600
9   2   F   2018-01-29  500
10  2   M   2018-03-29  100
11  2   P   2018-08-29  100
12  2   M   2018-10-29  100
13  2   F   2018-12-29  500
14  3   M   2017-03-20  300
15  3   F   2018-06-20  700
16  3   P   2018-08-20  100
17  3   M   2018-10-20  250
18  3   F   2018-11-20  100
19  3   P   2018-12-20  100
20  3   F   2019-03-20  600
22  4   M   2017-08-10  800
23  4   F   2018-06-10  100
24  4   P   2018-08-10  120
25  4   F   2018-10-10  500
26  4   M   2019-01-10  200
27  4   F   2019-06-10  600

现在，对于每个ID，最后一个状态为失败

然后从上面的df中，我想在数据框下面准备

ID  No_of_F  No_of_M  No_of_P  SLS  NoDays_to_SLS NoDays_SLS_to_LS
1   3        2        2        P    487           151
2   3        3        2        M    487           61
3   3        2        2        P    640           90
4   3        1        1        M    518           151

SLS =倒数第二个状态

LS =最后状态

我尝试了以下代码来计算持续时间。

df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
df['D'] = df.groupby('ID')['Date'].diff().dt.days


ID  Status  Date    Cost    D
0   1   F   2017-06-22  500 nan
1   1   M   2017-07-22  100 30.00
2   1   P   2017-10-22  100 92.00
3   1   F   2018-06-22  600 243.00
4   1   M   2018-08-22  150 61.00
5   1   P   2018-10-22  120 61.00
6   1   F   2019-03-22  750 151.00
7   2   M   2017-06-29  200 nan
8   2   F   2017-09-29  600 92.00
9   2   F   2018-01-29  500 122.00
10  2   M   2018-03-29  100 59.00
11  2   P   2018-08-29  100 153.00
12  2   M   2018-10-29  100 61.00
13  2   F   2018-12-29  500 61.00
14  3   M   2017-03-20  300 nan
15  3   F   2018-06-20  700 457.00
16  3   P   2018-08-20  100 61.00
17  3   M   2018-10-20  250 61.00
18  3   F   2018-11-20  100 31.00
19  3   P   2018-12-20  100 30.00
20  3   F   2019-03-20  600 90.00
21  3   M   2019-05-20  200 61.00
22  4   M   2017-08-10  800 nan
23  4   F   2018-06-10  100 304.00
24  4   P   2018-08-10  120 61.00
25  4   F   2018-10-10  500 61.00
26  4   M   2019-01-10  200 92.00
27  4   F   2019-06-10  600 151.00

Answer 1

您可以使用groupby.cumsum和transform.idxmax创建掩码以过滤DataFrame：

s = df.Status.eq('F').groupby(df.ID).cumsum().groupby(df.ID).transform('idxmax')
mask = s.index <= s
df_filtered = df[mask]

然后，您可以将pandas.crosstab和groupby.apply与iloc结合使用来创建其他列：

df_new = pd.crosstab(df_filtered.ID, df_filtered.Status).add_prefix('No_of_')
df_new['SLS'] = df_filtered.groupby('ID')['Status'].apply(lambda x: x.iloc[-2])
df_new['NoDays_to_SLS'] = (df_filtered.groupby('ID')['Date'].apply(lambda x: x.iloc[-2]) - df_filtered.groupby('ID')['Date'].first()).dt.days
df_new['NoDays_SLS_to_LS'] = (df_filtered.groupby('ID')['Date'].last() - df_filtered.groupby('ID')['Date'].apply(lambda x: x.iloc[-2])).dt.days

[出]

Status  No_of_F  No_of_M  No_of_P SLS  NoDays_to_SLS  NoDays_SLS_to_LS
ID                                                                    
1             3        2        2   P            487               151
2             3        3        1   M            487                61
3             3        2        2   P            640                90
4             3        2        1   M            518               151

Answer 2

另一种方法。它不需要最后一行为'F'，因此不需要.iloc [-2]：

# the following two lines are only required if the date column is a string
# and the dataframe is not already sorted
df['Date']= df['Date'].astype('datetime64')
df.sort_values(['ID', 'Date'], axis='index', ascending=True, inplace=True)

# add some auxillary columns
df['aux_date']= df['Date'].where(df['Status'] != 'F', pd.NaT)
df= pd.concat([df, pd.get_dummies(df['Status'])], axis='columns')

# calculate the relevant dates and the sums of the statuses
grouped= df.groupby('ID')
aggregated= grouped.agg({'Date': 'min', 'aux_date': 'max', 'No_of_F': 'sum', 'No_of_M': 'sum', 'No_of_P': 'sum'})
aggregated['max_date']= grouped['Date'].apply(max)

# calculate the date differences
aggregated.loc[~aggregated['aux_date'].isnull(), 'NoDays_to_SLS']=    aggregated['aux_date'] - aggregated['Date']
aggregated.loc[~aggregated['aux_date'].isnull(), 'NoDays_SLS_to_LS']= aggregated['max_date'] - aggregated['aux_date']

# drop the columns that were just inserted for calculation
aggregated.drop(['Date', 'aux_date', 'max_date'], axis='columns', inplace=True)

输出为：

    No_of_F  No_of_M  No_of_P NoDays_to_SLS NoDays_SLS_to_LS
ID                                                          
1         3        2        2      487 days         151 days
2         3        3        1      487 days          61 days
3         3        2        2      640 days          90 days
4         3        2        1      518 days         151 days
5         2        2        2      487 days           0 days

注意：我插入ID 5来表示我的意思是我不依赖最后一条记录为F。ID5只是ID 1行的副本，其中最后一条F行已删除。这就是NoDays_SLS_to_LS为该行为零的原因。

NoDays...列的类型为timedelta64。如果您想将它们存储为整数，则可以执行以下操作：

aggregated['NoDays_to_SLS']=    aggregated['NoDays_to_SLS'].dt.days
aggregated['NoDays_SLS_to_LS']= aggregated['NoDays_SLS_to_LS'].dt.days

Grouby过滤器，计算每个类别，持续时间并根据日期确定倒数第二个

2 个答案: