如何在TSV文件中填写缺失的序列行

时间:2017-01-17 13:47:55

标签: python loops while-loop infinite

我仍然是初学者,所以对于初学者抱歉这个问题很可能是一个明显的答案,对于凌乱的代码感到抱歉,但我有几十行的文件。我正在使用某种窗口框架技术来滑动我的文件,所以我需要确保每个窗口都在那里。但是,我的一些输入文件错过了某些行,因此我尝试在Python中编写代码以添加这些行和我想要的信息,以使文件完整。这就是代码的样子:

#!/usr/bin/env python

outfile = open ("missing_test.txt", "w")

with open("add_missing.txt", "r") as file:
    last_line = 0   #This is where it starts for bin 1
    lines = []
    header_line = next(file)
    outfile.write(header_line)
    CHROM = 'BABA_1'
    for line in file:     #go through every line to check its existence and rewrite to new file
        nums = line.split("\t")
        num1 = nums[0]        #no integer because this is a string: name individual
        num2 = int(nums[1])   #integer for window
        num3 = int(nums[2])   #integer for coverage (here always 10000 to met treshold)
        num4 = int(nums[3])   #integer for SNP count   
        if num1 == CHROM:     #
            while num2 != last_line + 10000:
                #A line is missing, so a new line is added with 0 SNPs:
                NUM2 = last_line + 10000   # New window, the one that was missing
                NUM4 = 0   #0 SNPs found
                #lines.append((num1, NUM2, num3, NUM4))
                OUTLINE = "%s\t%s\t%s\t%s" % (num1, NUM2, num3, NUM4) #write new line to outfile       
                outfile.write(OUTLINE + "\n")
                last_line += 10000
            lines.append((num1,num2,num3,num4))
            last_line += 10000    #also add 10000 here otherwise the while loop makes no sense
            outline = "%s\t%s\t%s\t%s" % (num1, num2, num3, num4)
            outfile.write(outline + "\n")   #write all existing lines to outfile

        else:
            CHROM = num1
            last_line = 0

outfile.close()        

只要第一个“CHROM”的第一个窗口等于0,这就完全可以正常工作,但情况并非总是如此。在后一种情况下,循环将是无限的。这是例如输入和DESIRED输出的样子:

输入:

indiv   window  coverage    SNP
BABA_1  20000   10000   7
BABA_1  30000   10000   1
BABA_1  50000   10000   2
BABA_1  60000   10000   3
BABA_1  80000   10000   1
BABA_10 20000   10000   1
BABA_10 30000   10000   16
BABA_10 80000   10000   9

期望的输出:

indiv   window  coverage    SNP
BABA_1  10000   10000   0
BABA_1  20000   10000   7
BABA_1  30000   10000   1
BABA_1  40000   10000   0
BABA_1  50000   10000   2
BABA_1  60000   10000   3
BABA_1  70000   10000   0
BABA_1  80000   10000   1
BABA_10 10000   10000   0
BABA_10 20000   10000   1
BABA_10 30000   10000   16
BABA_10 40000   10000   0
BABA_10 50000   10000   0
BABA_10 60000   10000   0
BABA_10 70000   10000   0
BABA_10 80000   10000   9

我一直在努力寻找答案来获得这个,而我的循环工作却无条件地进行,但我认真地看不到我的缺陷。有没有人对我如何解决这个问题提示?

非常感谢任何帮助,提前谢谢!

2 个答案:

答案 0 :(得分:2)

尝试以下几点:

#!/usr/bin/python

outfile = open ("missing_test.txt", "w")

def write_line(indiv, window, coverage, snp):
    outline = "%s\t%s\t%s\t%s\n" % (indiv, window, coverage, snp)
    outfile.write(outline)

with open("add_missing.txt", "r") as file:
    lines = file.readlines()
    write_line(*lines.pop(0).rstrip().split("\t"))
    first_line = lines[0].split("\t")
    last_indiv = first_line[0]
    last_window = int(first_line[1])

    for line in lines:
        indiv, window, coverage, snp = line.split("\t")
        window = int(window)
        coverage = int(coverage)
        snp = int(snp)

        if indiv == last_indiv:
            # If the current window is higher than expected,
            # insert a line with the missing window.
            # Repeat until we get to the expected window.
            while window > last_window + 10000:
                write_line(indiv, last_window + 10000, coverage, 0)
                last_window += 10000
            last_window = window
        else:
            last_indiv = indiv
            last_window = window
        write_line(indiv, window, coverage, snp)

它不包含的是某个窗口编号是给定indiv中第一个窗口编号的期望,因为您没有定义该行为,而您对此的评论相当混乱。

运行此脚本后missing_test.txt的内容:

indiv window  coverage    SNP
BABA_1    20000   10000   7
BABA_1    30000   10000   1
BABA_1    40000   10000   0
BABA_1    50000   10000   2
BABA_1    60000   10000   3
BABA_1    70000   10000   0
BABA_1    80000   10000   1
BABA_10   20000   10000   1
BABA_10   30000   10000   16
BABA_10   40000   10000   0
BABA_10   50000   10000   0
BABA_10   60000   10000   0
BABA_10   70000   10000   0
BABA_10   80000   10000   9

答案 1 :(得分:1)

您可以使用以下方法,首先构建一个空列表,然后在将它们作为行写入输出之前将任何存在的条目分配到其中:

import csv
import itertools

with open('add_missing.txt', 'rb') as f_input, open('missing_test.txt', 'wb') as f_output:
    csv_input = csv.reader(f_input, delimiter='\t', skipinitialspace=True)
    csv_output = csv.writer(f_output, delimiter='\t')
    csv_output.writerow(next(csv_input))

    for k, g in itertools.groupby(csv_input, lambda x: x[0]):
        empty = [[k, x * 10000, 10000, 0] for x in range(1, 9)]
        for row in g:
            empty[int(row[1]) / 10000 - 1] = row

        csv_output.writerows(empty)   

给你:

indiv   window  coverage    SNP
BABA_1  10000   10000   0
BABA_1  20000   10000   7
BABA_1  30000   10000   1
BABA_1  40000   10000   0
BABA_1  50000   10000   2
BABA_1  60000   10000   3
BABA_1  70000   10000   0
BABA_1  80000   10000   1
BABA_10 10000   10000   0
BABA_10 20000   10000   1
BABA_10 30000   10000   16
BABA_10 40000   10000   0
BABA_10 50000   10000   0
BABA_10 60000   10000   0
BABA_10 70000   10000   0
BABA_10 80000   10000   9