如果两列之间存在反向,则Pandas会删除重复项

时间:2014-07-10 12:34:26

标签: python pandas

我有一个包含2列的数据集,如下所示......

InteractorA InteractorB
AGAP028204  AGAP005846
AGAP028204  AGAP003428
AGAP028200  AGAP011124
AGAP028200  AGAP004335
AGAP028200  AGAP011356
AGAP028194  AGAP008414

我正在使用Pandas,我想删除两次出现的行,但只是按照以下方式反转...从这...

InteractorA InteractorB
AGAP002741  AGAP008026
AGAP008026  AGAP002741

对此...

InteractorA InteractorB
AGAP002741  AGAP008026

因为它们对于所有意图和目的都是一样的。

是否有内置方法来处理这个问题?

3 个答案:

答案 0 :(得分:5)

我最终创建了一个hacky脚本,它遍历行和必要的数据片段,并检查是否出现连接或是否出现反向,并根据需要删除行索引。

import pandas as pd

checklist = []
indexes_to_drop = []

interactions = pd.read_csv('original_interactions.txt', delimiter = '\t')

for index, row in interactions.iterrows():
    check_string = row['InteractorA'] + row['InteractorB']
    check_string_rev = row['InteractorB'] + row['InteractorA']
    if (check_string or check_string_rev) in checklist:
        indexes_to_drop.append(index)
    else:
        pass
    checklist.append(check_string)
    checklist.append(check_string_rev)

no_dups = interactions.drop(interactions.index[indexes_to_drop])

print no_dups.shape

no_dups.to_csv('no_duplicates.txt',sep='\t',index = False)

2017年编辑:几年后,凭借更多的经验,对于寻找类似内容的人来说,这是一个更优雅的解决方案:

In [8]: df
Out[8]:
  InteractorA InteractorB
0  AGAP028204  AGAP005846
1  AGAP028204  AGAP003428
2  AGAP028200  AGAP011124
3  AGAP028200  AGAP004335
4  AGAP028200  AGAP011356
5  AGAP028194  AGAP008414
6  AGAP002741  AGAP008026
7  AGAP008026  AGAP002741

In [18]: df['check_string'] = df.apply(lambda row: ''.join(sorted([row['InteractorA'], row['InteractorB']])), axis=1)

In [19]: df
Out[19]:
  InteractorA InteractorB          check_string
0  AGAP028204  AGAP005846  AGAP005846AGAP028204
1  AGAP028204  AGAP003428  AGAP003428AGAP028204
2  AGAP028200  AGAP011124  AGAP011124AGAP028200
3  AGAP028200  AGAP004335  AGAP004335AGAP028200
4  AGAP028200  AGAP011356  AGAP011356AGAP028200
5  AGAP028194  AGAP008414  AGAP008414AGAP028194
6  AGAP002741  AGAP008026  AGAP002741AGAP008026
7  AGAP008026  AGAP002741  AGAP002741AGAP008026

In [20]: df.drop_duplicates('check_string')
Out[20]:
  InteractorA InteractorB          check_string
0  AGAP028204  AGAP005846  AGAP005846AGAP028204
1  AGAP028204  AGAP003428  AGAP003428AGAP028204
2  AGAP028200  AGAP011124  AGAP011124AGAP028200
3  AGAP028200  AGAP004335  AGAP004335AGAP028200
4  AGAP028200  AGAP011356  AGAP011356AGAP028200
5  AGAP028194  AGAP008414  AGAP008414AGAP028194
6  AGAP002741  AGAP008026  AGAP002741AGAP008026

答案 1 :(得分:0)

我认为以下内容可行:

In [37]:
import pandas as pd
import io
temp = """InteractorA InteractorB
AGAP028204  AGAP005846
AGAP028204  AGAP003428
AGAP028200  AGAP011124
AGAP028200  AGAP004335
AGAP028200  AGAP011356
AGAP028194  AGAP008414
AGAP002741  AGAP008026
AGAP008026  AGAP002741"""
df = pd.read_csv(io.StringIO(temp), sep='\s+')
df
Out[37]:
  InteractorA InteractorB
0  AGAP028204  AGAP005846
1  AGAP028204  AGAP003428
2  AGAP028200  AGAP011124
3  AGAP028200  AGAP004335
4  AGAP028200  AGAP011356
5  AGAP028194  AGAP008414
6  AGAP002741  AGAP008026
7  AGAP008026  AGAP002741

所以我下载了你的数据并误解了你想要的东西,所以现在可以使用以下内容:

# first get the values that are unique
In [72]:
df1 = df[~df.InteractorA.isin(df.InteractorB)]
df1.shape
Out[72]:
(2386, 2)

现在我们想要获取重复的行但是取第一个值:

In [74]:

df2 = df[df.InteractorA.isin(df.InteractorB)]
df2 = df2.groupby('InteractorA').first().reset_index()
df2.shape
Out[74]:
(3074, 2)

现在连接2个数据帧:

In [75]:

merged = pd.concat([df1, df2], ignore_index=True)
merged.shape
Out[75]:
(5460, 2)

我认为现在这是正确的。

答案 2 :(得分:0)

这是我设法为自己的目的而努力的最干净的解决方案。

创建一个列,其中每一行都在排序列表中组合

df['sorted_row'] = [sorted([a,b]) for a,b in zip(df.InteractorA, df.InteractorB)]

不能删除列表上的重复项,以便该列应为字符串

df['sorted_row'] = df['sorted_row'].astype(str)

删除重复项

df.drop_duplicates(subset=['sorted_row'], inplace=True)