将fasta序列解析为字典

时间:2014-03-27 20:44:23

标签: python dictionary fasta

我需要最简单的解决方案来转换包含多个核苷酸序列的fasta.txt,如

>seq1
TAGATTCTGAGTTATCTCTTGCATTAGCAGGTCATCCTGGTCAAACCGCTACTGTTCCGG
CTTTCTGATAATTGATAGCATACGCTGCGAACCCACGGAAGGGGGTCGAGGACAGTGGTG
>seq2
TCCCTCTAGAGGCTCTTTACCGTGATGCTACATCTTACAGGTATTTCTGAGGCTCTTTCA
AACAGGTGCGCGTGAACAACAACCCACGGCAAACGAGTACAGTGTGTACGCCTGAGAGTA
>seq3
GGTTCCGCTCTAAGCCTCTAACTCCCGCACAGGGAAGAGATGTCGATTAACTTGCGCCCA
TAGAGCTCTGCGCGTGCGTCGAAGGCTCTTTTCGCGATATCTGTGTGGTCTCACTTTGGT

到字典(名称,值)对象,其中name将是>标题,值将被分配给相应的序列。

下面你可以通过2个列表找到我失败的尝试(对于包含> 1行的长序列不起作用)

f = open('input2.txt', 'r')
list={}
names=[]
seq=[]
for line in f:
 if line.startswith('>'):
  names.append(line[1:-1])
 elif line.startswith('A') or line.startswith('C') or line.startswith('G') or line.startswith('T'):
  seq.append(line)

list = dict(zip(names, seq))

如果你向我提供如何修复它的解决方案以及如何通过单独的功能来实现它,我将感激不尽。

感谢您的帮助,

格列勃

2 个答案:

答案 0 :(得分:3)

最好使用biopython库

from Bio import SeqIO
input_file = open("input.fasta")
my_dict = SeqIO.to_dict(SeqIO.parse(input_file, "fasta"))

答案 1 :(得分:2)

对代码的简单修正:

from collections import defaultdict #this will make your life simpler
f = open('input2.txt','r')
list=defaultdict(str)
name = ''
for line in f:
    #if your line starts with a > then it is the name of the following sequence
    if line.startswith('>'):
        name = line[1:-1]
        continue #this means skips to the next line
    #This code is only executed if it is a sequence of bases and not a name.
    list[name]+=line.strip()

<强>更新

由于我已收到通知说这个旧答案被推翻,我已决定使用Python 3.7呈现我现在认为的正确解决方案。转换为Python 2.7只需要删除键入导入行和函数注释:

from collections import OrderedDict
from typing import Dict

NAME_SYMBOL = '>'


def parse_sequences(filename: str,
                    ordered: bool=False) -> Dict[str, str]:
    """
    Parses a text file of genome sequences into a dictionary.
    Arguments:
      filename: str - The name of the file containing the genome info.
      ordered: bool - Set this to True if you want the result to be ordered.
    """
    result = OrderedDict() if ordered else {}

    last_name = None
    with open(filename) as sequences:
        for line in sequences:
            if line.startswith(NAME_SYMBOL):
                last_name = line[1:-1]
                result[last_name] = []
            else:
                result[last_name].append(line[:-1])

    for name in result:
        result[name] = ''.join(result[name])

    return result

现在,我意识到OP要求最简单的解决方案&#34;然而,由于他们正在使用基因组数据,因此假设每个序列可能非常大,似乎是公平的。在这种情况下,通过将序列行收集到列表中来优化一点是有意义的,然后在末尾的那些列表上使用str.join方法来产生最终结果。