Question

我有一个数据框，想要消除具有相同值但在不同列中的重复行：

df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})

df
Out[8]: 
   a  b  c  d
1  x  y  e  f
2  e  f  x  y
3  w  v  s  t

行[1]，[2]的值为{x，y，e，f}，但是它们以十字形排列 - 即如果你要交换列c，d与a，b在行[2]中你会有重复的。我想删除这些行，只保留一行，以获得最终输出：

df_new
Out[20]: 
   a  b  c  d
1  x  y  e  f
3  w  v  s  t

我怎样才能有效地实现这一目标？

Answer 1

我认为您需要按boolean indexing进行过滤，并使用numpy.sort duplicated创建掩码，对其进行反转使用~：

df = df[~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()]
print (df)
   a  b  c  d
1  x  y  e  f
3  w  v  s  t

详情：

print (np.sort(df, axis=1))
[['e' 'f' 'x' 'y']
 ['e' 'f' 'x' 'y']
 ['s' 't' 'v' 'w']]

print (pd.DataFrame(np.sort(df, axis=1), index=df.index))
   0  1  2  3
1  e  f  x  y
2  e  f  x  y
3  s  t  v  w

print (pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1    False
2     True
3    False
dtype: bool

print (~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())

1     True
2    False
3     True
dtype: bool

Answer 2

这是另一个解决方案，带有for循环：

data = df.as_matrix()
new = []

for row in data:
    if not new:
        new.append(row)
    else:
        if not any([c in nrow for nrow in new for c in row]):
            new.append(row)
new_df = pd.DataFrame(new, columns=df.columns)

Answer 3

使用排序（np.sort），然后从中获取重复项（.duplicated()）。稍后使用重复项删除（df.drop）所需的索引

import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})

df_duplicated = pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()
index_to_drop = [ind for ind in range(len(df_duplicated)) if df_duplicated[ind]]
df.drop(df.index[df_duplicated])

Pandas在交叉值中找到Duplicates

3 个答案: