我有一个包含数百万次点击的BLAST表格输出.Con是我的序列,P是蛋白质命中。我有兴趣区分对应于下面说明的3种情况的命中。它们应该全部打印在3个单独的新文件中,文件1中的重叠群不应该在文件2,3等中。如何做到这一点?
con1 ----------------------- (Contigs with both overlapping and non overlapping hits)
p1---- p2 ------ p4---
p3-----
con2 --------------------- (only overlapping) con3 ----------------(only non overlp)
p1 ----- p1 ---- p2 -----
p2 -------
p3 -----
如果我知道蛋白质起始和终止位点,则可以识别重叠或非重叠;如果S1 < E2&lt; S2和E1&lt; S2&lt; E2 OR S2-E1&gt; 0。 我的输入文件,即
contig protein start end
con1 P1 481 931
con1 P2 140 602
con1 P3 232 548
con2 P4 335 406
con2 P5 642 732
con2 P6 2282 2433
con2 P7 729 812
con3 P8 17 148
con3 P9 289 45
我的脚本(这只会打印出不重叠的点击)
from itertools import groupby
def nonoverlapping(hits):
"""Returns a list of non-overlapping hits."""
nonover = []
overst = False
for i in range(1,len(hits)):
(p, c) = hits[i-1], hits[i]
if c[2] > p[3]:
if not overst: nonover.append(p)
nonover.append(c)
overst = True
return nonover
fh = open('file.txt')
oh = open('result.txt', 'w')
for qid, grp in groupby(fh, lambda l: l.split()[0]):
hits = []
for line in grp:
hsp = line.split()
hsp[2], hsp[3] = int(hsp[2]), int(hsp[3])
hits.append(hsp)
if len(hits) > 1:
hits.sort(key=lambda x: x[2])
for hit in nonoverlapping(hits):
oh.write('\t'.join([str(f) for f in hit])+'\n')
答案 0 :(得分:2)
我会做这样的事情。为两个命中定义“重叠”函数,然后测试每个重叠群是否全部,一些或不重叠。然后将所有重叠群写入所需的文件:
from itertools import groupby
def overlaps(a, b):
result = True
# Supposing a[2] is the start, a[3] the end.
# If end before start, they are not overlapping
if a[3] < b[2] or b[3] < a[2]:
result = False
return result
def test_overlapping(hits):
overlapping = 'None'
overlapping_count = 0
for i in range(len(hits)-1):
if overlaps(hits[i], hits[i+1]):
overlapping_count += 1
if overlapping_count == 0:
overlapping = 'None'
elif overlapping_count == len(hits) -1:
overlapping = 'All'
else:
overlapping = 'Some'
return overlapping
fh = open('file.txt')
file_all = open('result_all.txt', 'w')
file_some = open('result_some.txt', 'w')
file_none = open('result_none.txt', 'w')
line = fh.readline() # quit header
for qid, grp in groupby(fh, lambda l: l.split()[0]):
hits = []
for line in grp:
hsp = line.split()
hsp[2], hsp[3] = int(hsp[2]), int(hsp[3])
hits.append(hsp)
if len(hits) > 1:
hits.sort(key=lambda x: x[2])
overlapping = test_overlapping(hits)
out_file = file_none
if overlapping == 'All':
out_file = file_all
elif overlapping == 'Some':
out_file = file_some
for h in hits:
out_file.write('\t'.join([str(v) for v in h]))
out_file.write('\n')
file_all.close()
file_some.close()
file_none.close()