我有一个如下数据框
ID time Item Status Combined
1 4/29/20 20:32 A OK A_OK
1 4/29/20 20:32 A OK A_OK
1 4/29/20 20:32 A OK A_OK
1 4/29/20 20:32 A OK A_OK
1 4/29/20 20:32 A FAIL A_FAIL
1 4/29/20 20:32 A FAIL A_FAIL
1 4/29/20 20:34 B OK B_OK
1 4/29/20 20:53 A OK A_OK
1 4/29/20 20:53 A OK A_OK
1 4/29/20 20:58 C OK C_OK
2 5/30/20 22:32 A OK A_OK
2 5/30/20 22:32 A OK A_OK
2 5/30/20 22:32 A OK A_OK
2 5/30/20 22:32 A FAIL A_FAIL
2 5/30/20 22:32 B OK B_OK
2 5/30/20 22:32 B OK B_OK
2 4/29/20 20:53 A OK A_OK
2 4/29/20 20:53 C FAIL C_FAIL
2 4/29/20 20:53 C FAIL C_FAIL
2 4/29/20 20:58 D OK D_OK
每个唯一项都在合并列中。
想要获得每个唯一的ID,请按以下顺序操作:
1 [[A_OK], [A_FAIL, B_OK], [A_OK], [C_OK]]
2 [[A_OK], [A_FAIL,B_OK], [A_OK, C_FAIL], [D_OK]]
如果不是上面的内容,则下面的内容也可以作为txt文件使用,其中每行代表一个ID,-1表示项目集的末尾,-2代表该ID的行末。
A_OK -1 A_FAIL B_OK -1 A_OK -1 C_OK -1 -2
A_OK -1 A_FAIL B_OK -1 A_OK C_FAIL -1 D_OK -1 -2
如果“合并”中的某个项目在2分钟的时间范围内,则该项目属于同一项目集(相同的子列表),或者被视为具有相同ID的另一个项目集。
答案 0 :(得分:0)
不确定要了解什么,但可以尝试使用apply(list)。例如:
import pandas as pd
df = pd.DataFrame( {'ID':['1','1','1','2','2','2'], 'Combined':['A_OK','A_FAIL','B_OK','A_FAIL','B_FAIL','C_OK']})
df
Out[3]:
ID Combined
0 1 A_OK
1 1 A_FAIL
2 1 B_OK
3 2 A_FAIL
4 2 B_FAIL
5 2 C_OK
df.groupby('ID')['Combined'].apply(list)
Out[4]:
ID
1 [A_OK, A_FAIL, B_OK]
2 [A_FAIL, B_FAIL, C_OK]
Name: Combined, dtype: object
答案 1 :(得分:0)
好的,所以我知道您想要这个,而不是您刚才写的:
1 [[A_OK,A_FAIL, B_OK], [A_OK], [C_OK]]
2 [[A_OK,A_FAIL,B_OK], [A_OK, C_FAIL], [D_OK]]
要这样做:
import pandas as pd
from datetime import datetime
"""Create the df"""
fmt = '%d-%m-%Y %H:%M'
d1 = datetime.strptime('17-07-2020 20:32', fmt)
d2 = datetime.strptime('17-07-2020 20:34', fmt)
d3 = datetime.strptime('17-07-2020 20:53', fmt)
d4 = datetime.strptime('17-07-2020 20:58', fmt)
df = pd.DataFrame( {'ID':['1','1','1','1','1','1','1','1','1','1','2','2','2','2','2','2','2','2','2','2'],'Date':[d1,d1,d1,d1,d1,d1,d2,d3,d3,d4,d1,d1,d1,d1,d1,d1,d2,d3,d3,d4],'Combined':['A_OK','A_OK','A_OK','A_OK','A_FAIL','A_FAIL','B_OK','A_OK','A_OK','C_OK','A_OK','A_OK','A_OK','A_OK','A_FAIL','A_FAIL','B_OK','A_OK','A_OK','C_OK']})
print(df)
ID Date Combined
0 1 2020-07-17 20:32:00 A_OK
1 1 2020-07-17 20:32:00 A_OK
2 1 2020-07-17 20:32:00 A_OK
3 1 2020-07-17 20:32:00 A_OK
4 1 2020-07-17 20:32:00 A_FAIL
5 1 2020-07-17 20:32:00 A_FAIL
6 1 2020-07-17 20:34:00 B_OK
7 1 2020-07-17 20:53:00 A_OK
8 1 2020-07-17 20:53:00 A_OK
9 1 2020-07-17 20:58:00 C_OK
10 2 2020-07-17 20:32:00 A_OK
11 2 2020-07-17 20:32:00 A_OK
12 2 2020-07-17 20:32:00 A_OK
13 2 2020-07-17 20:32:00 A_OK
14 2 2020-07-17 20:32:00 A_FAIL
15 2 2020-07-17 20:32:00 A_FAIL
16 2 2020-07-17 20:34:00 B_OK
17 2 2020-07-17 20:53:00 A_OK
18 2 2020-07-17 20:53:00 A_OK
19 2 2020-07-17 20:58:00 C_OK
"""Create periods"""
df['Date'] = pd.to_datetime(df['Date'])
diffs = df['Date'] - df['Date'].shift()
laps = diffs > pd.Timedelta('2 min')
periods = laps.cumsum().apply(lambda x: 'period_{}'.format(x+1))
df['2min_period'] = periods
print(df)
ID Date Combined 2min_period
0 1 2020-07-17 20:32:00 A_OK period_1
1 1 2020-07-17 20:32:00 A_OK period_1
2 1 2020-07-17 20:32:00 A_OK period_1
3 1 2020-07-17 20:32:00 A_OK period_1
4 1 2020-07-17 20:32:00 A_FAIL period_1
5 1 2020-07-17 20:32:00 A_FAIL period_1
6 1 2020-07-17 20:34:00 B_OK period_1
7 1 2020-07-17 20:53:00 A_OK period_2
8 1 2020-07-17 20:53:00 A_OK period_2
9 1 2020-07-17 20:58:00 C_OK period_3
10 2 2020-07-17 20:32:00 A_OK period_3
11 2 2020-07-17 20:32:00 A_OK period_3
12 2 2020-07-17 20:32:00 A_OK period_3
13 2 2020-07-17 20:32:00 A_OK period_3
14 2 2020-07-17 20:32:00 A_FAIL period_3
15 2 2020-07-17 20:32:00 A_FAIL period_3
16 2 2020-07-17 20:34:00 B_OK period_3
17 2 2020-07-17 20:53:00 A_OK period_4
18 2 2020-07-17 20:53:00 A_OK period_4
19 2 2020-07-17 20:58:00 C_OK period_5
"""finally the groupby"""
groupped=df.groupby(['ID','2min_period'])['Combined'].apply(list).reset_index()
groupped['Combined']=groupped.Combined.map(pd.unique)
groupped=groupped.groupby(['ID'])['Combined'].apply(list).reset_index()
print(groupped)
ID Combined
0 1 [[A_OK, A_FAIL, B_OK], [A_OK], [C_OK]]
1 2 [[A_OK, A_FAIL, B_OK], [A_OK], [C_OK]]