我有一个包含2列“支架”的文本文件,如下所示:
scaffold1|size14662 scaffold1|size14662
scaffold1|size14662 scaffold2|size14565
scaffold1|size14662 scaffold111160|size1478
scaffold2|size14565 scaffold2|size14565
scaffold2|size14565 scaffold1|size14662
scaffold2|size14565 scaffold239623|size320
scaffold3|size14436 scaffold3|size14436
scaffold3|size14436 scaffold5|size13770
scaffold3|size14436 scaffold5|size13770
scaffold3|size14436 scaffold149|size9055
scaffold4|size14291 scaffold4|size14291
scaffold4|size14291 scaffold32275|size3028
scaffold4|size14291 scaffold66288|size2175
scaffold5|size13770 scaffold5|size13770
scaffold5|size13770 scaffold133|size9198
scaffold5|size13770 scaffold149|size9055
scaffold6|size13181 scaffold6|size13181
scaffold6|size13181 scaffold92|size9644
scaffold6|size13181 scaffold113496|size1447
scaffold7|size13167 scaffold7|size13167
右栏中的“支架”与左栏上相应的“支架”“匹配”(如“相同”),例如:
[scaffold1|size14662, scaffold2|size14565, scaffold111160|size1478]
右栏中的与左栏中的scaffold1|size14662
相同。
我需要从这个文件获取一个列表(不是python列表,只是一个列表),其中包含所有匹配支架的集合,如下所示:
scaffold1|size14662
scaffold2|size14565
scaffold111160|size1478
scaffold239623|size320
---
scaffold7|size13167
---
scaffold5|size13770
scaffold3|size14436
scaffold149|size9055
scaffold133|size9198
---
scaffold92|size9644
scaffold113496|size1447
scaffold6|size13181
---
scaffold32275|size3028
scaffold66288|size2175
scaffold4|size14291
我能够生成一些执行此操作的代码,但它反复遍历同一个列表时速度非常慢。由于我使用的文件大约有2M行,因此这不是一个好的解决方案。
rawscafs = open ("columnfile")
scafs={}
for line in rawscafs:
cont = 0
splitvalues=line.split()
for k,v in scafs.items():
if splitvalues[1] in v:
cont = 1
elif splitvalues[0] in v:
scafs[k].add(splitvalues[1])
cont = 1
if cont == 1:
cont = 0
continue
if splitvalues[0] in scafs:
scafs[splitvalues[0]].add(splitvalues[1])
else:
scafs[splitvalues[0]] = set()
scafs[splitvalues[0]].add(splitvalues[1])
rawscafs.close()
for key in scafs:
for i in (scafs[key]):
print(i+"\n")
print("---\n")
rawscafs.close()
正如您所看到的,这是一个丑陋的代码,但我只是在寻找一个快速而又肮脏的解决方案。我显然还没有找到。 任何人都可以帮我优化这段代码(或提供一个更简单的解决方案,因为我确信必须有一个,我只是无法弄明白。)
答案 0 :(得分:0)
感谢@DSM提供指针!使用那里提供的信息,我能够找到解决问题的方法。这是:
#!/usr/bin/python3
infile = open('columnfile','r')
title = ""
scaf = set()
scafs = []
for lines in infile:
lines = lines.split()
if lines[0] != title:
title = lines[0]
scafs.append(scaf)
scaf = set()
scaf.add(lines[1])
else:
scaf.add(lines[1])
scafs.append(scafs)
del scafs[0]
del scafs[-1]
infile.close()
def consolidate(sets):
setlist = [s for s in sets if s]
for i, s1 in enumerate(setlist):
if s1:
for s2 in setlist[i+1:]:
intersection = s1.intersection(s2)
if intersection:
s2.update(s1)
s1.clear()
s1 = s2
return [s for s in setlist if s]
for i in consolidate(scafs):
for a in i:
print(a)
print("---")
是的,我知道它仍然看起来很难看,但是现在它完成了我需要做的事情。一旦我将它插入我的程序,它肯定会更好。