Question

我是Python的新手，所以请耐心等待。

我无法让这个小脚本正常工作：

genome = open('refT.txt','r')

数据文件 - 具有一堆（200万）重叠群的参考基因组：

Contig_01
TGCAGGTAAAAAACTGTCACCTGCTGGT
Contig_02
TGCAGGTCTTCCCACTTTATGATCCCTTA
Contig_03
TGCAGTGTGTCACTGGCCAAGCCCAGCGC
Contig_04
TGCAGTGAGCAGACCCCAAAGGGAACCAT
Contig_05
TGCAGTAAGGGTAAGATTTGCTTGACCTA

文件已打开：

cont_list = open('dataT.txt','r')

我要从上面列出的数据集中提取的重叠群列表：

Contig_01
Contig_02
Contig_03
Contig_05

我绝望的剧本：

for line in cont_list:
    if genome.readline() not in line:
        continue
    else:
        a=genome.readline()
        s=line+a    
        data_out = open ('output.txt','a')
        data_out.write("%s" % s)
        data_out.close()

input('Press ENTER to exit')

脚本成功地将前三个重叠群写入输出文件，但由于某种原因，它似乎无法跳过“contig_04”，这不在列表中，而是转到“Contig_05”。

我可能看起来像是一个懒惰的混蛋发布这个，但我整个下午都花了这么一点代码-_-

Answer 1

我首先尝试生成一个迭代，它给你一个元组：(contig, gnome)：

def pair(file_obj):
    for line in file_obj:
        yield line, next(file_obj)

现在，我会用它来获得所需的元素：

wanted = {'Contig_01', 'Contig_02', 'Contig_03', 'Contig_05'}
with open('filename') as fin:
    pairs = pair(fin)
    while wanted:
        p = next(pairs)
        if p[0] in wanted:
            # write to output file, store in a list, or dict, ...
            wanted.forget(p[0])

Answer 2

我会推荐几件事：

尝试使用with open(filename, 'r') as f代替f = open(...) / f.close()。 with将为您处理结算。它还鼓励您在一个地方处理所有文件IO。
尝试将所需的所有重叠群读入列表或其他结构。一次打开许多文件是一件痛苦的事。一次读取所有行并存储它们。

以下是一些示例代码，可能会执行您正在寻找的内容

from itertools import izip_longest

# Read in contigs from file and store in list
contigs = []
with open('dataT.txt', 'r') as contigfile:
    for line in contigfile:
        contigs.append(line.rstrip()) #rstrip() removes '\n' from EOL

# Read through genome file, open up an output file
with open('refT.txt', 'r') as genomefile, open('out.txt', 'w') as outfile:
    # Nifty way to sort through fasta files 2 lines at a time
    for name, seq in izip_longest(*[genomefile]*2):
        # compare the contig name to your list of contigs
        if name.rstrip() in contigs:
            outfile.write(name) #optional. remove if you only want the seq
            outfile.write(seq)

Answer 3

这是一种非常紧凑的方法来获取你想要的序列。

def get_sequences(data_file, valid_contigs):
    sequences = []

    with open(data_file) as cont_list:
        for line in cont_list:
            if line.startswith(valid_contigs):
                sequence = cont_list.next().strip()
                sequences.append(sequence)

    return sequences

if __name__ == '__main__':
    valid_contigs = ('Contig_01', 'Contig_02', 'Contig_03', 'Contig_05')
    sequences = get_sequences('dataT.txt', valid_contigs)
    print(sequences)

利用startswith（）的能力接受元组作为参数并检查任何匹配。如果该行匹配您想要的（所需的重叠群），它将抓取下一行并在删除不需要的空白字符后将其附加到序列。从那里，将抓取的序列写入输出文件非常简单。

示例输出：

['TGCAGGTAAAAAACTGTCACCTGCTGGT',
 'TGCAGGTCTTCCCACTTTATGATCCCTTA',
 'TGCAGTGTGTCACTGGCCAAGCCCAGCGC',
 'TGCAGTAAGGGTAAGATTTGCTTGACCTA']

引用使用Python的名称列表

3 个答案: