Question

我已对此进行过研究，但没有找到任何可以执行以下操作的信息。我需要

groupby一组列（此处：studentid，subj，topic，lesson）。
然后我需要在列的子集中找到重复的行（这里：testtime，响应时间）。
创建一个列，指示行（跨2列）是否重复

启动数据框

   studentid   subj   topic  lesson  testtime    responsetime
1  1           math   add    a       timestamp1  45sec
2  1           math   add    a       timestamp1  45sec
3  1           math   add    a       timestamp2  30sec
4  1           math   add    a       timestamp3  15sec
5  1           math   add    b       timestamp1  0sec
6  1           math   add    b       timestamp1  0sec
7  1           math   add    b       timestamp1  45sec
8  1           math   add    b       timestamp1  45sec

我尝试过的内容：方法1： 使用重复的用户定义函数创建一个列，指示行是否重复 - 错误：'function'对象不可订阅

def check_dup(list):
    return df.duplicated([list],keep='first')

df_alt['dup_values'] = df.groupby(['studentidd', 'subj','topic','lesson']). apply(check_dup['testtime','responsetime'],axis=1)

方法2： 使用多索引，但问题是重复的函数在索引行中查找重复项，而不是在单独的列集中（'testtime'，'responsetime'）：

  dfnew['dup_indicator'] = df.set_index(['studentidd', 'subj','topic','lesson']).
duplicated(['testtime','responsetime'],keep=False)

所需数据框

   studentid   subj   topic  lesson  testtime   responsetime dup_indicator
1  1           math   add    a       timestamp1  45sec             1
2  1           math   add    a       timestamp1  45sec             1
3  1           math   add    a       timestamp2  30sec             0
4  1           math   add    a       timestamp3  15sec             0
5  1           math   add    b       timestamp1  0sec              1 
6  1           math   add    b       timestamp1  0sec              1
7  1           math   add    b       timestamp1  45sec             1
8  1           math   add    b       timestamp1  45sec             1

Answer 1

您无需使用;;或修改索引即可完成您想要执行的操作。只需传入要用于标识重复项的所有列：

groupby

分解步骤：

查找> df studentid subj topic lesson testtime responsetime 1 1 math add a timestamp1 45sec 2 1 math add a timestamp1 45sec 3 1 math add a timestamp2 30sec 4 1 math add a timestamp3 15sec 5 1 math add b timestamp1 0sec 6 1 math add b timestamp1 0sec 7 1 math add b timestamp1 45sec 8 1 math add b timestamp1 45sec > dup_cols = ['studentid', 'subj', 'topic', 'lesson', 'testtime', 'responsetime'] > df.loc[df.duplicated(subset=dup_cols, keep=False), 'dup_indicator'] = 1 > df['dup_indicator'].fillna(0, inplace=True) > df studentid subj topic lesson testtime responsetime dup_indicator 1 1 math add a timestamp1 45sec 1.0 2 1 math add a timestamp1 45sec 1.0 3 1 math add a timestamp2 30sec 0.0 4 1 math add a timestamp3 15sec 0.0 5 1 math add b timestamp1 0sec 1.0 6 1 math add b timestamp1 0sec 1.0 7 1 math add b timestamp1 45sec 1.0 8 1 math add b timestamp1 45sec 1.0返回df.duplicated的所有行，在这种情况下，根据传递给True参数的内容重复所有行
使用subset
创建新列.loc并将dup_indicator分配给重复的行
使用1将fillna分配给非重复行

使用具有重复

1 个答案: