我有一个文件,我需要从中删除重复的对(以粗体标记)。
输入文件:
AT1G01010 = 0005634
**AT1G01010 = 0006355**
AT1G01010 = 0003677
AT1G01010 = 0007275
**AT1G01010 = 0006355
AT1G01010 = 0006355**
AT1G01010 = 0006888
**AT1G01020 = 0016125**
AT1G01020 = 0016020
**AT1G01020 = 0005739**
**AT1G01020 = 0016125**
AT1G01020 = 0003674
AT1G01020 = 0005783
**AT1G01020 = 0005739**
**AT1G01020 = 0006665
AT1G01020 = 0006665**
预期产出:
AT1G01010 = 0005634
AT1G01010 = 0006355
AT1G01010 = 0003677
AT1G01010 = 0007275
AT1G01010 = 0006888
AT1G01020 = 0016125
AT1G01020 = 0016020
AT1G01020 = 0005739
AT1G01020 = 0003674
AT1G01020 = 0005783
AT1G01020 = 0006665
因此,为了删除重复项,我首先制作了一本字典。创建字典后,我尝试了这种编码:
import sys
ara_go_file = open (sys.argv[1]).readlines()
ara_id_list = []
ara_go_list = []
for lines in ara_go_file:
split_lines = lines.split(' ')
ara_id = split_lines[0]
ara_id_list.append(ara_id)
go_id_split = split_lines[-1]
go_id = go_id_split.split('\n')[0]
ara_go_list.append(go_id)
ara_id_go_dic = dict (zip(ara_id_list, ara_go_list)) ##ara_id_go_dic (this is the name of the dict I have created)
new_dict = {} # made a new dict to copy the data into this n remove the duplicate pairs
for k in ara_id_go_dic.items():
if k[0] in new_dict:
if k[1] not in new_dict[k[0]]:
new_dict[k[0]].append(k[1])
else:
new_dict[k[0]]=[k[1]]
print new_dict
我不知道我到底犯了什么错误。
请让我知道我的错误,否则如果有其他方法可以删除重复对。
答案 0 :(得分:2)
您可以使用set
删除重复的元素:
>>> s="""AT1G01010 = 0006355
... AT1G01010 = 0003677
... AT1G01010 = 0007275
... AT1G01010 = 0006355
... AT1G01010 = 0006355
... AT1G01010 = 0006888
... AT1G01020 = 0016125
... AT1G01020 = 0016020
... AT1G01020 = 0005739
... AT1G01020 = 0016125
... AT1G01020 = 0003674
... AT1G01020 = 0005783
... AT1G01020 = 0005739
... AT1G01020 = 0006665
... AT1G01020 = 0006665"""
>>> for j in set([i for i in s.split('\n')]):
... print j
...
AT1G01010 = 0005634
AT1G01020 = 0016020
AT1G01010 = 0007275
AT1G01010 = 0006355
AT1G01020 = 0006665
AT1G01010 = 0003677
AT1G01020 = 0005783
AT1G01020 = 0016125
AT1G01020 = 0005739
AT1G01020 = 0003674
AT1G01010 = 0006888
答案 1 :(得分:0)
使用CSV模块并设置:
<强>输入强>
相同的提及。
<强>演示:强>
import csv
p = "dp-input.txt"
result = set()
with open(p , "rb") as fp:
root = csv.reader(fp, delimiter='=')
for row in root:
result.add((row[0], row[1]))
p1 = "dp-output.txt"
with open(p1 , "wb") as fp:
root = csv.writer(fp, delimiter='=')
root.writerows(result)
<强>输出:强>
AT1G01010 = 0006888
AT1G01020 = 0016020
AT1G01020 = 0005739
AT1G01010 = 0007275
AT1G01020 = 0003674
AT1G01020 = 0016125
AT1G01020 = 0005783
AT1G01020 = 0006665
AT1G01010 = 0003677
AT1G01010 = 0005634
AT1G01010 = 0006355