我的数据框如下所示。
请注意,我们只有F-P-F模式。即我们的数据只会有F-P-F模式。
ID Status Date Duration
0 1 F 2018-06-22 nan
1 1 P 2018-08-22 61.00
2 1 F 2018-10-22 61.00
3 3 F 2018-11-20 nan
4 3 P 2018-12-20 30.00
5 3 F 2019-03-20 90.00
6 4 F 2018-06-10 nan
7 4 P 2018-08-10 61.00
8 4 F 2018-12-10 122.00
9 7 F 2018-04-10 nan
10 7 P 2018-08-10 122.00
11 7 F 2018-11-10 92.00
12 7 P 2019-08-10 273.00
13 7 F 2019-10-10 61.00
我想从上面的数据框中准备下面的数据框。
ID F_P_Duration F_F_Duration
1 61.0 122.0
3 30.0 120.0
4 61.0 183.0
7_1 122.0 214.0
7_2 273.0 334.0
其中F_P_Duration是从F到P的天数
F_F_Duration是该ID的F-P-F模式中从F到F的天数
答案 0 :(得分:1)
根据ID列,您似乎只是从一行中获取持续时间,还是与前一行求和。真正的诡计在于安排和标记。我认为下面的代码应该很容易解释。
# imports
import numpy as np
import pandas as pd
# Setup the data and the DataFrame.
data = [[1, 'F', '2018-06-22', np.nan],
[1, 'P', '2018-08-22', 61.00],
[1, 'F', '2018-10-22', 61.00],
[3, 'F', '2018-11-20', np.nan],
[3, 'P', '2018-12-20', 30.00],
[3, 'F', '2019-03-20', 90.00],
[4, 'F', '2018-06-10', np.nan],
[4, 'P', '2018-08-10', 61.00],
[4, 'F', '2018-12-10', 122.00],
[7, 'F', '2018-04-10', np.nan],
[7, 'P', '2018-08-10', 122.00],
[7, 'F', '2018-11-10', 92.00],
[7, 'P', '2019-08-10', 273.00],
[7, 'F', '2019-10-10', 61.00]]
df = pd.DataFrame(data=data, columns=['ID', 'Status', 'Date', 'Duration'])
# Add a helper column for summing F_F durations.
df['DurShiftSum'] = df['Duration'] + df['Duration'].shift(1)
# F_P duration just appears to be the duration at P.
df.loc[df['Status']=='P', 'F_P_Duration'] = df.loc[df['Status']=='P', 'Duration']
# F_F durations is the F duration plus the previous P duration.
df.loc[(df['Status']=='F')&(df['Duration'].notnull()), 'F_F_Duration'] =
df.loc[(df['Status']=='F')&(df['Duration'].notnull()), 'DurShiftSum']
# Compress the DataFrame and drop unneeded columns.
df['F_F_Duration'] = df['F_F_Duration'].fillna(method='bfill', limit=1)
df = df.dropna(subset=['F_P_Duration'])
df = df.drop(labels=['Date', 'Duration', 'DurShiftSum'], axis=1)
# An unfortunate for-loop through the unique IDs.
# If your dataset is very big this might not be ideal.
df['ID'] = df['ID'].astype(str)
for xid in df['ID'].unique():
if len(df.loc[df['ID']==xid]) > 1:
len_frame = len(df.loc[df['ID']==xid])
new_ids = [xid+f'_{i}' for i in range(1, len_frame+1)]
df.loc[df['ID']==xid, 'ID'] = new_ids
根据您提供的示例数据,我将与您期望的结果相匹配。我要做的就是创建一个辅助列,将两个相邻的行相加,将适当的值传输到F_P和F_F列,然后进行清理和格式化。