Question

我有一个如下数据框

ID  time    Item    Status  Combined
1   4/29/20 20:32   A   OK  A_OK
1   4/29/20 20:32   A   OK  A_OK
1   4/29/20 20:32   A   OK  A_OK
1   4/29/20 20:32   A   OK  A_OK
1   4/29/20 20:32   A   FAIL    A_FAIL
1   4/29/20 20:32   A   FAIL    A_FAIL
1   4/29/20 20:34   B   OK  B_OK
1   4/29/20 20:53   A   OK  A_OK
1   4/29/20 20:53   A   OK  A_OK
1   4/29/20 20:58   C   OK  C_OK
2   5/30/20 22:32   A   OK  A_OK
2   5/30/20 22:32   A   OK  A_OK
2   5/30/20 22:32   A   OK  A_OK
2   5/30/20 22:32   A   FAIL    A_FAIL
2   5/30/20 22:32   B   OK  B_OK
2   5/30/20 22:32   B   OK  B_OK
2   4/29/20 20:53   A   OK  A_OK
2   4/29/20 20:53   C   FAIL    C_FAIL
2   4/29/20 20:53   C   FAIL    C_FAIL
2   4/29/20 20:58   D   OK  D_OK

每个唯一项都在合并列中。

想要获得每个唯一的ID，请按以下顺序操作：

1 [[A_OK], [A_FAIL, B_OK], [A_OK], [C_OK]]
2 [[A_OK], [A_FAIL,B_OK], [A_OK, C_FAIL], [D_OK]]

如果不是上面的内容，则下面的内容也可以作为txt文件使用，其中每行代表一个ID，-1表示项目集的末尾，-2代表该ID的行末。

A_OK -1 A_FAIL B_OK -1 A_OK -1 C_OK -1 -2
A_OK -1 A_FAIL B_OK -1 A_OK C_FAIL -1 D_OK -1 -2

如果“合并”中的某个项目在2分钟的时间范围内，则该项目属于同一项目集（相同的子列表），或者被视为具有相同ID的另一个项目集。

Answer 1

不确定要了解什么，但可以尝试使用apply（list）。例如：

    import pandas as pd

df = pd.DataFrame( {'ID':['1','1','1','2','2','2'], 'Combined':['A_OK','A_FAIL','B_OK','A_FAIL','B_FAIL','C_OK']})

df
Out[3]: 
  ID Combined
0  1     A_OK
1  1   A_FAIL
2  1     B_OK
3  2   A_FAIL
4  2   B_FAIL
5  2     C_OK

df.groupby('ID')['Combined'].apply(list)
Out[4]: 
ID
1      [A_OK, A_FAIL, B_OK]
2    [A_FAIL, B_FAIL, C_OK]
Name: Combined, dtype: object

Answer 2

好的，所以我知道您想要这个，而不是您刚才写的：

1 [[A_OK,A_FAIL, B_OK], [A_OK], [C_OK]]
2 [[A_OK,A_FAIL,B_OK], [A_OK, C_FAIL], [D_OK]]

要这样做：

import pandas as pd 
from datetime import datetime
"""Create the df"""
fmt = '%d-%m-%Y %H:%M'
d1 = datetime.strptime('17-07-2020 20:32', fmt)
d2 = datetime.strptime('17-07-2020 20:34', fmt)
d3 = datetime.strptime('17-07-2020 20:53', fmt)
d4 = datetime.strptime('17-07-2020 20:58', fmt)
df = pd.DataFrame( {'ID':['1','1','1','1','1','1','1','1','1','1','2','2','2','2','2','2','2','2','2','2'],'Date':[d1,d1,d1,d1,d1,d1,d2,d3,d3,d4,d1,d1,d1,d1,d1,d1,d2,d3,d3,d4],'Combined':['A_OK','A_OK','A_OK','A_OK','A_FAIL','A_FAIL','B_OK','A_OK','A_OK','C_OK','A_OK','A_OK','A_OK','A_OK','A_FAIL','A_FAIL','B_OK','A_OK','A_OK','C_OK']})
print(df)
   ID                Date Combined
0   1 2020-07-17 20:32:00     A_OK
1   1 2020-07-17 20:32:00     A_OK
2   1 2020-07-17 20:32:00     A_OK
3   1 2020-07-17 20:32:00     A_OK
4   1 2020-07-17 20:32:00   A_FAIL
5   1 2020-07-17 20:32:00   A_FAIL
6   1 2020-07-17 20:34:00     B_OK
7   1 2020-07-17 20:53:00     A_OK
8   1 2020-07-17 20:53:00     A_OK
9   1 2020-07-17 20:58:00     C_OK
10  2 2020-07-17 20:32:00     A_OK
11  2 2020-07-17 20:32:00     A_OK
12  2 2020-07-17 20:32:00     A_OK
13  2 2020-07-17 20:32:00     A_OK
14  2 2020-07-17 20:32:00   A_FAIL
15  2 2020-07-17 20:32:00   A_FAIL
16  2 2020-07-17 20:34:00     B_OK
17  2 2020-07-17 20:53:00     A_OK
18  2 2020-07-17 20:53:00     A_OK
19  2 2020-07-17 20:58:00     C_OK

"""Create periods"""
df['Date'] = pd.to_datetime(df['Date'])
diffs = df['Date'] - df['Date'].shift()
laps = diffs > pd.Timedelta('2 min')
periods = laps.cumsum().apply(lambda x: 'period_{}'.format(x+1))
df['2min_period'] = periods
print(df)
   ID                Date Combined 2min_period
0   1 2020-07-17 20:32:00     A_OK    period_1
1   1 2020-07-17 20:32:00     A_OK    period_1
2   1 2020-07-17 20:32:00     A_OK    period_1
3   1 2020-07-17 20:32:00     A_OK    period_1
4   1 2020-07-17 20:32:00   A_FAIL    period_1
5   1 2020-07-17 20:32:00   A_FAIL    period_1
6   1 2020-07-17 20:34:00     B_OK    period_1
7   1 2020-07-17 20:53:00     A_OK    period_2
8   1 2020-07-17 20:53:00     A_OK    period_2
9   1 2020-07-17 20:58:00     C_OK    period_3
10  2 2020-07-17 20:32:00     A_OK    period_3
11  2 2020-07-17 20:32:00     A_OK    period_3
12  2 2020-07-17 20:32:00     A_OK    period_3
13  2 2020-07-17 20:32:00     A_OK    period_3
14  2 2020-07-17 20:32:00   A_FAIL    period_3
15  2 2020-07-17 20:32:00   A_FAIL    period_3
16  2 2020-07-17 20:34:00     B_OK    period_3
17  2 2020-07-17 20:53:00     A_OK    period_4
18  2 2020-07-17 20:53:00     A_OK    period_4
19  2 2020-07-17 20:58:00     C_OK    period_5

"""finally the groupby"""
groupped=df.groupby(['ID','2min_period'])['Combined'].apply(list).reset_index()
groupped['Combined']=groupped.Combined.map(pd.unique)

groupped=groupped.groupby(['ID'])['Combined'].apply(list).reset_index()
print(groupped)
  ID                                Combined
0  1  [[A_OK, A_FAIL, B_OK], [A_OK], [C_OK]]
1  2  [[A_OK, A_FAIL, B_OK], [A_OK], [C_OK]]

创建交易数据序列

2 个答案: