Python - 如何摆脱行

时间:2017-01-02 13:51:38

标签: python arrays algorithm

我的文件里面看起来像这样:

1 33725 36725 ENHANCER0002 1 711760 714760 ENHANCER0003 1 724150 727150 ENHANCER0004 1 725455 728455 ENHANCER0005 1 871280 874410 ENHANCER0006 1 874180 877180 ENHANCER0007 1 900540 903540 ENHANCER0008 1 901475 904475 ENHANCER0009 1 910260 913260 ENHANCER00010 1 933355 936355 ENHANCER00011 1 947660 950660 ENHANCER00012 1 1013530 1016530 ENHANCER00013 . . . 1 2477030 2480030 ENHANCER00043 1 2478160 2481160 ENHANCER00044 1 2478845 2481845 ENHANCER00045

中间的两列是我的下边界和上边界。就像在第3-4行或第5-6行中一样,边界重叠。我必须以某种方式重塑它,如果边界重叠,它只打印最低的下边界和最高的上边界。我正在使用Python来获得这样的解决方案,这是我的代码:

def write_line(chr_no,tmp_l,tmp_h,cnt,filename):
    filename.write(str(chr_no)+"\t"+str(tmp_l)+"\t"+str(tmp_h)+"\t"+"ENHANCER000"+str(cnt)+"\n")


inf = open("/home/firat/Desktop/Onder_Lab/Kenan/enhancers_bj.bed","r")
outf = open("/home/firat/Desktop/deneme_v3.bed","w")

cnt = 0
tmp_l=0
tmp_h=0

tmp_list = []

for line in inf:
    cnt += 1
    line = line.split(' ')
    current_low = line[1]
    current_high = line[2]
    previous_low = tmp_l
    previous_high = tmp_h
    if (int(current_low) <= int(previous_high)):
        tmp_list.append(int(current_low))
        tmp_list.append(int(current_high))
        tmp_list.append(int(previous_low))
        tmp_list.append(int(previous_high))
        write_line(line[0],min(tmp_list),max(tmp_list),cnt,outf)
        tmp_l = min(tmp_list)
        tmp_h = max(tmp_list)
        tmp_list = []
    else:
        write_line(line[0], previous_low, previous_high, cnt, outf)
        tmp_l= current_low
        tmp_h= current_high

虽然我的解决方案看起来很有效,但输出如下:

1 27460 30460 ENHANCER0002 1 33725 36725 ENHANCER0003 1 711760 714760 ENHANCER0004 1 724150 728455 ENHANCER0005 1 724150 728455 ENHANCER0006 1 871280 877180 ENHANCER0007 1 871280 877180 ENHANCER0008 1 900540 904475 ENHANCER0009 1 900540 904475 ENHANCER00010 1 910260 913260 ENHANCER00011 1 933355 936355 ENHANCER00012 1 947660 950660 ENHANCER00013 1 1013530 1016530 ENHANCER00014 . . . 1 2477030 2481160 ENHANCER00044 1 2477030 2481845 ENHANCER00045 1 2477030 2481845 ENHANCER00046 注意到,当边界重叠时,打印会出现重复。还有一些情况,其中3条线与最底部重叠。预期的输出应该是:

1 27460 30460 ENHANCER0002 1 33725 36725 ENHANCER0003 1 711760 714760 ENHANCER0004 1 724150 728455 ENHANCER0005 1 871280 877180 ENHANCER0006 1 900540 904475 ENHANCER0007 1 910260 913260 ENHANCER0008 . . . 1 2477030 2481845 ENHANCER00046

我的代码有什么问题,即使有超过2行的重叠,我怎样才能改进它?

1 个答案:

答案 0 :(得分:0)

对于简单的任务,您的代码似乎过于复杂。您不需要使用四个变量 - tmp_l,tmp_h,previous_low和previous_high。您也不需要维护当前的重叠间隔列表。您需要做的就是保持重叠间隔的低点和高点。

但是,代码的问题在于每次迭代都会调用write_line。你想要做的只是当当前的低点超过前一个高点时调用write_line,这意味着前一组重叠间隔已经结束,并且也就是在循环结束时。

以下代码可行:

for line in inf.splitlines():
    cnt += 1
    line = line.split(' ')
    current_low = int(line[1])
    current_high = int(line[2])
    if current_low <= previous_high:
        previous_high = current_high
    else:
        if previous_high > 0:
            write_line(line[0], previous_low, previous_high, cnt, outf)
        previous_low = current_low
        previous_high = current_high

if previous_high > 0:
    write_line(line[0], previous_low, previous_high, cnt, outf)

需要检查if previous_high > 0以不输出previous_low和previous_high - 0,0的默认值。在for循环结束时需要额外的write_line来输出最后一组重叠间隔。

当代码间隔超过2个时,此代码也可以使用。