我有一个类似下面的数据框:
A B
chair bed
bed chair
spoon knife
plate cup
所以,第1行和第2行对我来说是重复的,我希望将它们删除。我怎么能以简单的方式做到这一点?
所以删除重复后我会:
A B
spoon knife
plate cup
谢谢。
答案 0 :(得分:2)
使用boolean indexing
转换后的掩码~
:
df = df[~pd.DataFrame(np.sort(df[['A','B']], axis=1)).duplicated(keep=False)]
另一个更慢的解决方案:
df = df[~df[['A','B']].apply(sorted, axis=1).duplicated(keep=False)]
print (df)
A B
2 spoon knife
3 plate cup
<强>详细强>:
print (pd.DataFrame(np.sort(df[['A','B']], axis=1)))
0 1
0 bed chair
1 bed chair
2 knife spoon
3 cup plate
print (pd.DataFrame(np.sort(df[['A','B']], axis=1)).duplicated(keep=False))
0 True
1 True
2 False
3 False
dtype: bool
<强>计时强>:
df = pd.concat([df] * 10000, ignore_index=True)
In [441]: %%timeit
...: df[~pd.DataFrame(np.sort(df[['A','B']], axis=1)).duplicated(keep=False)]
...:
100 loops, best of 3: 9.38 ms per loop
In [442]: %%timeit
...: df[~df[['A','B']].apply(sorted, axis=1).duplicated(keep=False)]
...:
1 loop, best of 3: 4.46 s per loop
#jpp solution
In [443]: %%timeit
...: df['C'] = list(map(frozenset, df[['A', 'B']].values.tolist()))
...: df.drop_duplicates('C', keep=False).drop('C', 1)
...:
10 loops, best of 3: 28.4 ms per loop
答案 1 :(得分:1)
这是使用frozenset
的一种方式:
df['C'] = list(map(frozenset, df[['A', 'B']].values.tolist()))
df = df.drop_duplicates('C', keep=False).drop('C', 1)
<强>结果强>
A B
2 spoon knife
3 plate cup
<强>解释强>
frozenset
列&#39; C&#39;来自&#39; A&#39;和&#39; B&#39;。keep=False
,然后删除列&#39; C&#39;。frozenset
是必需的,而不是set
,因为集合不可用。