我已对此进行过研究,但没有找到任何可以执行以下操作的信息。我需要
启动数据框
studentid subj topic lesson testtime responsetime
1 1 math add a timestamp1 45sec
2 1 math add a timestamp1 45sec
3 1 math add a timestamp2 30sec
4 1 math add a timestamp3 15sec
5 1 math add b timestamp1 0sec
6 1 math add b timestamp1 0sec
7 1 math add b timestamp1 45sec
8 1 math add b timestamp1 45sec
我尝试过的内容: 方法1: 使用重复的用户定义函数创建一个列,指示行是否重复 - 错误:'function'对象不可订阅
def check_dup(list):
return df.duplicated([list],keep='first')
df_alt['dup_values'] = df.groupby(['studentidd', 'subj','topic','lesson']). apply(check_dup['testtime','responsetime'],axis=1)
方法2: 使用多索引,但问题是重复的函数在索引行中查找重复项,而不是在单独的列集中('testtime','responsetime'):
dfnew['dup_indicator'] = df.set_index(['studentidd', 'subj','topic','lesson']).
duplicated(['testtime','responsetime'],keep=False)
所需数据框
studentid subj topic lesson testtime responsetime dup_indicator
1 1 math add a timestamp1 45sec 1
2 1 math add a timestamp1 45sec 1
3 1 math add a timestamp2 30sec 0
4 1 math add a timestamp3 15sec 0
5 1 math add b timestamp1 0sec 1
6 1 math add b timestamp1 0sec 1
7 1 math add b timestamp1 45sec 1
8 1 math add b timestamp1 45sec 1
答案 0 :(得分:0)
您无需使用;;
或修改索引即可完成您想要执行的操作。只需传入要用于标识重复项的所有列:
groupby
分解步骤:
> df
studentid subj topic lesson testtime responsetime
1 1 math add a timestamp1 45sec
2 1 math add a timestamp1 45sec
3 1 math add a timestamp2 30sec
4 1 math add a timestamp3 15sec
5 1 math add b timestamp1 0sec
6 1 math add b timestamp1 0sec
7 1 math add b timestamp1 45sec
8 1 math add b timestamp1 45sec
> dup_cols = ['studentid', 'subj', 'topic', 'lesson', 'testtime', 'responsetime']
> df.loc[df.duplicated(subset=dup_cols, keep=False), 'dup_indicator'] = 1
> df['dup_indicator'].fillna(0, inplace=True)
> df
studentid subj topic lesson testtime responsetime dup_indicator
1 1 math add a timestamp1 45sec 1.0
2 1 math add a timestamp1 45sec 1.0
3 1 math add a timestamp2 30sec 0.0
4 1 math add a timestamp3 15sec 0.0
5 1 math add b timestamp1 0sec 1.0
6 1 math add b timestamp1 0sec 1.0
7 1 math add b timestamp1 45sec 1.0
8 1 math add b timestamp1 45sec 1.0
返回df.duplicated
的所有行,在这种情况下,根据传递给True
参数的内容重复所有行subset
.loc
并将dup_indicator
分配给重复的行1
将fillna
分配给非重复行