从文件python

时间:2016-06-14 14:51:57

标签: python

我有一个类似

的文件
geneA geneB 134
geneC geneF 395
geneH geneD 958
geneF geneC 395
geneB geneA 134
geneD geneH 958

我想删除具有相同基因的行(顺序相反),以便我得到

geneA geneB 134
geneC geneF 395
geneH geneD 958    

到目前为止我有这个,但是当我尝试使用replace()或if not语句时,我得到了更多的重复。关于如何改变这一点的任何想法?

with open(filename, 'r') as handle, open(outfilename, 'a') as w:

    for line in handle:
        element = line.split()
        gene1 = element[0]
        gene2 = element[1]

        for line in handle:
            matchingelement = line.split()
            gene3 = matchingelement[0]
            gene4 = matchingelement[1]

            if gene3 == gene2 and gene4 == gene1:
                """Remove the line"""

1 个答案:

答案 0 :(得分:3)

将基因转换为可以添加到集合中的可混合形式,并随着时间的推移检查该集合。在这个例子中,我对基因进行了排序,以便顺序无关紧要,然后将它们构建回一个“规范化”的字符串。

filename = 'a.txt'
outfilename = 'aout.txt'

seen = set()

with open(filename, 'r') as handle, open(outfilename, 'a') as w:
    for line in handle:
        element = line.split()
        # a hashable "normalized" view of the genes
        genes = '-'.join(sorted(element[0:2]))
        if genes not in seen:
            seen.add(genes)
            w.write(line)

print(open(outfilename).read())