我有一个数据集:
id url keep_anyway field
1 A.com Yes X
2 A.com Yes Y
3 B.com No Y
4 B.com No X
5 C.com No X
我想删除带有条件的“url”重复项:
预期输出为:
id url keep_anyway field
1 A.com Yes X
2 A.com Yes Y
4 B.com No X
5 C.com No X
我能够通过以下方式管理条件1:
df.loc[(df['keep_aanyway'] =='Yes') | ~df['url'].duplicated()]
但是如何设置条件2?
请注意,“field”列的可能值为X或Y,如果我有重复项,我知道我确实有一个“X”和一个“Y”值。
我想也许我可以在“字段”列中从A到Z排序然后在df.duplicated中有“keep_first”= True,但我认为它已被弃用,不是吗?
答案 0 :(得分:2)
试试这个:
import numpy as np
duplicates = df.duplicated(subset='url')
keep_anyway_bool = df['keep_away'] == 'Yes' # (credit @acushner for pointing this out)
field_bool = df['field'] == 'X' # (credit @acushner for pointing this out)
df[np.invert(duplicates) | keep_anyway_bool | field_bool]