我的文件里面看起来像这样:
1 33725 36725 ENHANCER0002
1 711760 714760 ENHANCER0003
1 724150 727150 ENHANCER0004
1 725455 728455 ENHANCER0005
1 871280 874410 ENHANCER0006
1 874180 877180 ENHANCER0007
1 900540 903540 ENHANCER0008
1 901475 904475 ENHANCER0009
1 910260 913260 ENHANCER00010
1 933355 936355 ENHANCER00011
1 947660 950660 ENHANCER00012
1 1013530 1016530 ENHANCER00013
.
.
.
1 2477030 2480030 ENHANCER00043
1 2478160 2481160 ENHANCER00044
1 2478845 2481845 ENHANCER00045
中间的两列是我的下边界和上边界。就像在第3-4行或第5-6行中一样,边界重叠。我必须以某种方式重塑它,如果边界重叠,它只打印最低的下边界和最高的上边界。我正在使用Python来获得这样的解决方案,这是我的代码:
def write_line(chr_no,tmp_l,tmp_h,cnt,filename):
filename.write(str(chr_no)+"\t"+str(tmp_l)+"\t"+str(tmp_h)+"\t"+"ENHANCER000"+str(cnt)+"\n")
inf = open("/home/firat/Desktop/Onder_Lab/Kenan/enhancers_bj.bed","r")
outf = open("/home/firat/Desktop/deneme_v3.bed","w")
cnt = 0
tmp_l=0
tmp_h=0
tmp_list = []
for line in inf:
cnt += 1
line = line.split(' ')
current_low = line[1]
current_high = line[2]
previous_low = tmp_l
previous_high = tmp_h
if (int(current_low) <= int(previous_high)):
tmp_list.append(int(current_low))
tmp_list.append(int(current_high))
tmp_list.append(int(previous_low))
tmp_list.append(int(previous_high))
write_line(line[0],min(tmp_list),max(tmp_list),cnt,outf)
tmp_l = min(tmp_list)
tmp_h = max(tmp_list)
tmp_list = []
else:
write_line(line[0], previous_low, previous_high, cnt, outf)
tmp_l= current_low
tmp_h= current_high
虽然我的解决方案看起来很有效,但输出如下:
1 27460 30460 ENHANCER0002
1 33725 36725 ENHANCER0003
1 711760 714760 ENHANCER0004
1 724150 728455 ENHANCER0005
1 724150 728455 ENHANCER0006
1 871280 877180 ENHANCER0007
1 871280 877180 ENHANCER0008
1 900540 904475 ENHANCER0009
1 900540 904475 ENHANCER00010
1 910260 913260 ENHANCER00011
1 933355 936355 ENHANCER00012
1 947660 950660 ENHANCER00013
1 1013530 1016530 ENHANCER00014
.
.
.
1 2477030 2481160 ENHANCER00044
1 2477030 2481845 ENHANCER00045
1 2477030 2481845 ENHANCER00046
注意到,当边界重叠时,打印会出现重复。还有一些情况,其中3条线与最底部重叠。预期的输出应该是:
1 27460 30460 ENHANCER0002
1 33725 36725 ENHANCER0003
1 711760 714760 ENHANCER0004
1 724150 728455 ENHANCER0005
1 871280 877180 ENHANCER0006
1 900540 904475 ENHANCER0007
1 910260 913260 ENHANCER0008
.
.
.
1 2477030 2481845 ENHANCER00046
我的代码有什么问题,即使有超过2行的重叠,我怎样才能改进它?
答案 0 :(得分:0)
对于简单的任务,您的代码似乎过于复杂。您不需要使用四个变量 - tmp_l,tmp_h,previous_low和previous_high。您也不需要维护当前的重叠间隔列表。您需要做的就是保持重叠间隔的低点和高点。
但是,代码的问题在于每次迭代都会调用write_line
。你想要做的只是当当前的低点超过前一个高点时调用write_line
,这意味着前一组重叠间隔已经结束,并且也就是在循环结束时。
以下代码可行:
for line in inf.splitlines():
cnt += 1
line = line.split(' ')
current_low = int(line[1])
current_high = int(line[2])
if current_low <= previous_high:
previous_high = current_high
else:
if previous_high > 0:
write_line(line[0], previous_low, previous_high, cnt, outf)
previous_low = current_low
previous_high = current_high
if previous_high > 0:
write_line(line[0], previous_low, previous_high, cnt, outf)
需要检查if previous_high > 0
以不输出previous_low和previous_high - 0,0的默认值。在for循环结束时需要额外的write_line来输出最后一组重叠间隔。
当代码间隔超过2个时,此代码也可以使用。