基于日期时间的分组方式,基于“ ID”和日期的分组顺序,然后按顺序表示

时间:2019-07-11 09:01:44

标签: pandas datetime pandas-groupby

我是熊猫功能的新手。 我有一个DF,如下所示。这是手机的维修数据。

  ID    Status       Date     Cost
0   1     F       22-Jun-17   500
1   1     M       22-Jul-17   100
2   2     M       29-Jun-17   200
3   3     M       20-Mar-17   300
4   4     M       10-Aug-17   800
5   2     F       29-Sep-17   600
6   2     F       29-Jan-18   500
7   1     F       22-Jun-18   600
8   3     F       20-Jun-18   700
9   1     M       22-Aug-18   150
10  1     F       22-Mar-19   750
11  3     M       20-Oct-18   250
12  4     F       10-Jun-18   100

我试图从以前的状态中找出每个ID的持续时间。

找到该ID的每个状态序列的平均值。

我的预期输出如下所示。

    ID  S1    S1_Dur  S2    S2_dur  S3   S3_dur  S4   S4_dur    Avg_MF  Avg_FM
0   1   F-M    30     M-F   335.00  F-M  61.00   M-F  750.00    542.50  45.50
1   2   M-F    92     F-F   122.00  NaN  nan     NaN    nan     92.00   nan
2   3   M-F    457    F-M   122.00  NaN  nan     NaN    nan     457.00  122.00
3   4   M-F    304    NaN   nan     NaN  nan     NaN    nan     304.00  nan


S1 = first sequence
S1_Dur = S1 Duration
Avg_MF = Average M-F Duration
Avg_FMn = Average F-M Duration

我尝试了以下代码

df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
df = df.reset_index().sort_values(['ID', 'Date', 'Status']).set_index(['ID', 'Status'])
df['Difference'] = df.groupby('ID')['Date'].transform(pd.Series.diff)
df.reset_index(inplace=True)

然后我得到了一个DF,如下图所示

   ID   Status  index   Date    Cost    Difference
0   1   F        0  2017-06-22  500     NaT
1   1   M        1  2017-07-22  100     30 days
2   1   F        7  2018-06-22  600     335 days
3   1   M        9  2018-08-22  150     61 days
4   1   F       10  2019-03-22  750     212 days
5   2   M        2  2017-06-29  200     NaT
6   2   F        5  2017-09-29  600     92 days
7   2   F        6  2018-01-29  500     122 days
8   3   M        3  2017-03-20  300     NaT
9   3   F        8  2018-06-20  700     457 days
10  3   M       11  2018-10-20  250     122 days
11  4   M        4  2017-08-10  800     NaT
12  4   F       12  2018-06-10  100     304 days

那之后我被困住了。

1 个答案:

答案 0 :(得分:1)

想法是通过DataFrameGroupBy.diff创建新的差异列,并将Status的移动值结合DataFrameGroupBy.shift。删除S列中缺少值的行。然后用DataFrame.unstackGroupBy.cumcount整形作为计数器列,按DataFrame.pivot_table每对S创建均值,最后使用DataFrame.join

df['Date'] = pd.to_datetime(df['Date'], format='%d-%b-%y')
df = df.sort_values(['ID', 'Date', 'Status'])

df['D'] = df.groupby('ID')['Date'].diff().dt.days
df['S'] = df.groupby('ID')['Status'].shift() + '-'+ df['Status']
df = df.dropna(subset=['S'])
df['g'] = df.groupby('ID').cumcount().add(1).astype(str)

df1 = df.pivot_table(index='ID', columns='S', values='D', aggfunc='mean').add_prefix('Avg_')

df2 = df.set_index(['ID', 'g'])[['S','D']].unstack().sort_index(axis=1, level=1)
df2.columns = df2.columns.map('_'.join)

df3 = df2.join(df1).reset_index()
print (df3)

   ID    D_1  S_1    D_2  S_2   D_3  S_3    D_4  S_4  Avg_F-F  Avg_F-M  \
0   1   30.0  F-M  335.0  M-F  61.0  F-M  212.0  M-F      NaN     45.5   
1   2   92.0  M-F  122.0  F-F   NaN  NaN    NaN  NaN    122.0      NaN   
2   3  457.0  M-F  122.0  F-M   NaN  NaN    NaN  NaN      NaN    122.0   
3   4  304.0  M-F    NaN  NaN   NaN  NaN    NaN  NaN      NaN      NaN   

   Avg_M-F  
0    273.5  
1     92.0  
2    457.0  
3    304.0