过滤序列

时间:2014-01-30 07:56:33

标签: python biopython

我与biopython有点斗争,我试图根据3个标准​​过滤一组蛋白质序列: 1)序列含有起始密码子,在我的protein.fasta文件中用M表示 2)序列含有终止密码子,表示为* 3)M和*之间的长度至少是我期望的长度的90%,这是一个新文件

这是我试图做的事情,定义的条件只是我脑子里的一个烂摊子,真的很感激一些帮助!

from Bio import SeqIO

source = 'protein.fasta'
outfile = 'filtered.fa'
sub1 ='M'
sub2 = '*'
length = 'protein_length.txt'

def seq_check(seq, sub1, sub2):
# basically a function to check whether seq contains both M and *, and is of the expected length

return

seqs = SeqIO.parse(source, 'fasta')
filtered = (seq for seq in seqs if seq_check(seq.seq, sub1, sub2, length))
SeqIO.write(filtered, outfile, 'fasta')


Protein datafile:
>comp12_c0_seq1:217-297
SR*THDYAALLTSHRSLDLVYVYNVV
>comp15_c0_seq1:3-197
*LCI*SCIVRVWLRYPSP*LANYFPQM*RLSAIRLF*ERLIYGPFLC*NYF*S*PKIAVHTYRS

Length datafile:
comp12_c0_seq1   50
comp15_c0_seq1   80

感谢您的帮助 克莱尔

1 个答案:

答案 0 :(得分:1)

如果您可以确定蛋白质和长度文件的顺序是相同的,那么您可能希望修改代码,使其不使用dict来提高大数据集的内存效率,例如:编写一个生成函数,生成转换为int的第二列,然后使用SeqIO.parse()迭代它的itertools.izip()。

def read_lengths(path):
    """Reads "length file" into dict mapping sequence ID to length."""
    lengths = {}
    with open(path) as f:
        for line in f:
            seq, length = line.strip().split()
            lengths[seq] = int(length)
    return lengths


def enclosed_substrings(s, start, stop):
    """Find all substrings starting with `start` and ending with `stop`."""
    startpos = 0
    stoppos = 0
    while True:
        startpos = s.find(start, startpos)
        if startpos < 0:
            break
        stoppos = s.find(stop, startpos + 1)
        if stoppos < 0:
            break
        yield s[startpos:stoppos + 1]
        startpos += 1


def seq_check(record, expected_lens, len_factor=0.9, start='M', stop='*'):
    min_len = expected_lens[record.id] * len_factor
    for sub in enclosed_substrings(record.seq, start, stop):
        if len(sub) >= min_len:
            return True
    return False


source_file = 'protein.fasta'
out_file = 'filtered.fa'
length_file = 'protein_length.txt'

expected_lengths = read_lengths(length_file)
seqs = SeqIO.parse(source_file, 'fasta')
filtered = (seq for seq in seqs if seq_check(seq, expected_lengths))
SeqIO.write(filtered, out_file, 'fasta')