在这种情况下如何使用python优化集匹配?

时间:2013-12-11 15:26:24

标签: python multiple-columns

我有一个包含2列“支架”的文本文件,如下所示:

scaffold1|size14662     scaffold1|size14662    
scaffold1|size14662     scaffold2|size14565    
scaffold1|size14662     scaffold111160|size1478
scaffold2|size14565     scaffold2|size14565    
scaffold2|size14565     scaffold1|size14662    
scaffold2|size14565     scaffold239623|size320 
scaffold3|size14436     scaffold3|size14436    
scaffold3|size14436     scaffold5|size13770    
scaffold3|size14436     scaffold5|size13770    
scaffold3|size14436     scaffold149|size9055   
scaffold4|size14291     scaffold4|size14291    
scaffold4|size14291     scaffold32275|size3028 
scaffold4|size14291     scaffold66288|size2175 
scaffold5|size13770     scaffold5|size13770    
scaffold5|size13770     scaffold133|size9198   
scaffold5|size13770     scaffold149|size9055   
scaffold6|size13181     scaffold6|size13181    
scaffold6|size13181     scaffold92|size9644    
scaffold6|size13181     scaffold113496|size1447
scaffold7|size13167     scaffold7|size13167    

右栏中的“支架”与左栏上相应的“支架”“匹配”(如“相同”),例如:

[scaffold1|size14662, scaffold2|size14565, scaffold111160|size1478]
右栏中的

与左栏中的scaffold1|size14662相同。

我需要从这个文件获取一个列表(不是python列表,只是一个列表),其中包含所有匹配支架的集合,如下所示:

scaffold1|size14662
scaffold2|size14565
scaffold111160|size1478
scaffold239623|size320
---
scaffold7|size13167
---
scaffold5|size13770
scaffold3|size14436
scaffold149|size9055
scaffold133|size9198
---
scaffold92|size9644
scaffold113496|size1447
scaffold6|size13181
---
scaffold32275|size3028
scaffold66288|size2175
scaffold4|size14291

我能够生成一些执行此操作的代码,但它反复遍历同一个列表时速度非常慢。由于我使用的文件大约有2M行,因此这不是一个好的解决方案。

rawscafs = open ("columnfile")

scafs={}
for line in rawscafs:
    cont = 0
    splitvalues=line.split()
    for k,v in scafs.items():
        if splitvalues[1] in v:
            cont = 1
        elif splitvalues[0] in v:
            scafs[k].add(splitvalues[1])
            cont = 1
    if cont == 1:
        cont = 0
        continue       
    if splitvalues[0] in scafs:
        scafs[splitvalues[0]].add(splitvalues[1])
    else:
        scafs[splitvalues[0]] = set()
        scafs[splitvalues[0]].add(splitvalues[1])
rawscafs.close()


for key in scafs:
    for i in (scafs[key]):
        print(i+"\n")
    print("---\n")

rawscafs.close()

正如您所看到的,这是一个丑陋的代码,但我只是在寻找一个快速而又肮脏的解决方案。我显然还没有找到。 任何人都可以帮我优化这段代码(或提供一个更简单的解决方案,因为我确信必须有一个,我只是无法弄明白。)

1 个答案:

答案 0 :(得分:0)

感谢@DSM提供指针!使用那里提供的信息,我能够找到解决问题的方法。这是:

#!/usr/bin/python3

infile = open('columnfile','r')

title = ""
scaf = set()
scafs = []
for lines in infile:
    lines = lines.split()
    if lines[0] != title:
        title = lines[0]
        scafs.append(scaf)
        scaf = set()
        scaf.add(lines[1])
    else:
        scaf.add(lines[1])

scafs.append(scafs)
del scafs[0]
del scafs[-1]

infile.close()

def consolidate(sets):
    setlist = [s for s in sets if s]
    for i, s1 in enumerate(setlist):
        if s1:
            for s2 in setlist[i+1:]:
                intersection = s1.intersection(s2)
                if intersection:
                    s2.update(s1)
                    s1.clear()
                    s1 = s2
    return [s for s in setlist if s]


for i in consolidate(scafs):
    for a in i:
        print(a)
    print("---")

是的,我知道它仍然看起来很难看,但是现在它完成了我需要做的事情。一旦我将它插入我的程序,它肯定会更好。