我有一个包含2列的数据集,如下所示......
InteractorA InteractorB
AGAP028204 AGAP005846
AGAP028204 AGAP003428
AGAP028200 AGAP011124
AGAP028200 AGAP004335
AGAP028200 AGAP011356
AGAP028194 AGAP008414
我正在使用Pandas,我想删除两次出现的行,但只是按照以下方式反转...从这...
InteractorA InteractorB
AGAP002741 AGAP008026
AGAP008026 AGAP002741
对此...
InteractorA InteractorB
AGAP002741 AGAP008026
因为它们对于所有意图和目的都是一样的。
是否有内置方法来处理这个问题?
答案 0 :(得分:5)
我最终创建了一个hacky脚本,它遍历行和必要的数据片段,并检查是否出现连接或是否出现反向,并根据需要删除行索引。
import pandas as pd
checklist = []
indexes_to_drop = []
interactions = pd.read_csv('original_interactions.txt', delimiter = '\t')
for index, row in interactions.iterrows():
check_string = row['InteractorA'] + row['InteractorB']
check_string_rev = row['InteractorB'] + row['InteractorA']
if (check_string or check_string_rev) in checklist:
indexes_to_drop.append(index)
else:
pass
checklist.append(check_string)
checklist.append(check_string_rev)
no_dups = interactions.drop(interactions.index[indexes_to_drop])
print no_dups.shape
no_dups.to_csv('no_duplicates.txt',sep='\t',index = False)
2017年编辑:几年后,凭借更多的经验,对于寻找类似内容的人来说,这是一个更优雅的解决方案:
In [8]: df
Out[8]:
InteractorA InteractorB
0 AGAP028204 AGAP005846
1 AGAP028204 AGAP003428
2 AGAP028200 AGAP011124
3 AGAP028200 AGAP004335
4 AGAP028200 AGAP011356
5 AGAP028194 AGAP008414
6 AGAP002741 AGAP008026
7 AGAP008026 AGAP002741
In [18]: df['check_string'] = df.apply(lambda row: ''.join(sorted([row['InteractorA'], row['InteractorB']])), axis=1)
In [19]: df
Out[19]:
InteractorA InteractorB check_string
0 AGAP028204 AGAP005846 AGAP005846AGAP028204
1 AGAP028204 AGAP003428 AGAP003428AGAP028204
2 AGAP028200 AGAP011124 AGAP011124AGAP028200
3 AGAP028200 AGAP004335 AGAP004335AGAP028200
4 AGAP028200 AGAP011356 AGAP011356AGAP028200
5 AGAP028194 AGAP008414 AGAP008414AGAP028194
6 AGAP002741 AGAP008026 AGAP002741AGAP008026
7 AGAP008026 AGAP002741 AGAP002741AGAP008026
In [20]: df.drop_duplicates('check_string')
Out[20]:
InteractorA InteractorB check_string
0 AGAP028204 AGAP005846 AGAP005846AGAP028204
1 AGAP028204 AGAP003428 AGAP003428AGAP028204
2 AGAP028200 AGAP011124 AGAP011124AGAP028200
3 AGAP028200 AGAP004335 AGAP004335AGAP028200
4 AGAP028200 AGAP011356 AGAP011356AGAP028200
5 AGAP028194 AGAP008414 AGAP008414AGAP028194
6 AGAP002741 AGAP008026 AGAP002741AGAP008026
答案 1 :(得分:0)
我认为以下内容可行:
In [37]:
import pandas as pd
import io
temp = """InteractorA InteractorB
AGAP028204 AGAP005846
AGAP028204 AGAP003428
AGAP028200 AGAP011124
AGAP028200 AGAP004335
AGAP028200 AGAP011356
AGAP028194 AGAP008414
AGAP002741 AGAP008026
AGAP008026 AGAP002741"""
df = pd.read_csv(io.StringIO(temp), sep='\s+')
df
Out[37]:
InteractorA InteractorB
0 AGAP028204 AGAP005846
1 AGAP028204 AGAP003428
2 AGAP028200 AGAP011124
3 AGAP028200 AGAP004335
4 AGAP028200 AGAP011356
5 AGAP028194 AGAP008414
6 AGAP002741 AGAP008026
7 AGAP008026 AGAP002741
所以我下载了你的数据并误解了你想要的东西,所以现在可以使用以下内容:
# first get the values that are unique
In [72]:
df1 = df[~df.InteractorA.isin(df.InteractorB)]
df1.shape
Out[72]:
(2386, 2)
现在我们想要获取重复的行但是取第一个值:
In [74]:
df2 = df[df.InteractorA.isin(df.InteractorB)]
df2 = df2.groupby('InteractorA').first().reset_index()
df2.shape
Out[74]:
(3074, 2)
现在连接2个数据帧:
In [75]:
merged = pd.concat([df1, df2], ignore_index=True)
merged.shape
Out[75]:
(5460, 2)
我认为现在这是正确的。
答案 2 :(得分:0)
这是我设法为自己的目的而努力的最干净的解决方案。
创建一个列,其中每一行都在排序列表中组合
df['sorted_row'] = [sorted([a,b]) for a,b in zip(df.InteractorA, df.InteractorB)]
不能删除列表上的重复项,以便该列应为字符串
df['sorted_row'] = df['sorted_row'].astype(str)
删除重复项
df.drop_duplicates(subset=['sorted_row'], inplace=True)