Question

我正在处理一个带有一些值的数据框。问题是，我可能重复了。

我继续做这个link，但是我找不到我需要的东西

我尝试使用df.duplicated()创建一个重复列表，该列表为每个索引提供True和False值。
然后为该列表中的每个索引指定结果是True，我使用df.loc[(df['id']== df['id'][dups]) ]从df中获得了ID。根据此结果，我调用一个函数GiveID（），该函数返回要从重复列表中删除的索引列表。因为我不需要遍历应该删除的重复项，所以可以在for循环中从重复项列表中删除这些索引而不会破坏所有内容吗？

这是我df的一个示例（重复项基于id列）：

   | id | type
--------------
0  | 312| data2
1  | 334| data
2  | 22 | data1
3  | 312| data8
#Here 0 and 3 are duplicates based on ID

这是我的代码的一部分：

duplicates = df.duplicated(subset='column_name',keep=False)
duplicates = duplicates[duplicates]


df_dup = df
listidx = []
i=0
for dups in duplicates.index:

    dup_id = df.loc[(df['id']== df['id'][dups])]
    for a in giveID(dup_id):
        if a not in listid:
            listidx.append(a)

#here i want to delete the all listidx from duplicates inside the for loop
#so that I don't iterate over unnecessary duplicates

def giveID(id)
#some code that returns a list of indexes

这是我的代码中duplicates的样子：

0          True
1          True
582        True
583        True
605        True
606        True
622        True
623        True
624        True
625        True
626        True
627        True
628        True
629        True
630        True
631        True
           ... 
1990368    True
1991030    True

我想得到相同的内容，但没有不必要的重复

Answer 1

如果您需要非重复ID的索引：

df = pd.DataFrame({'ID':[0,1,1,3], 'B':[0,1,2,3]})
   B  ID
0  0   0
1  1   1
2  2   1
3  3   3

# List of indexes
non_duplicated = df.drop_duplicates(subset='ID', keep=False).index

df.loc[df.index.isin(non_duplicated)]
   B  ID
0  0   0
3  3   3

正在使用此列表的for循环中更新列表

1 个答案: