用多行作为一个字符串填充字典

时间:2019-05-24 11:47:41

标签: python python-3.x dictionary bioinformatics biopython

我有一个FASTA格式的多行文件,我希望将其分解成几部分,并用这些片断填充字典。

>piece_1 
Lorem ipsum dolor sit amet
consectetur adipiscing elit. Nam a pellentesque mi. 
>piece_2 
Integer dignissim ultrices eros a consequat. Praesent vestibulum
>piece_3 
Morbi eget sollicitudin mauris. Nunc varius felis 
vitae dui congue hendrerit. Nam semper venenatis auctor.  
Suspendisse potenti. Suspendisse facilisis velit vel convallis 
fringilla. Duis condimentum auctor mauris eu lobortis. 

我想根据上面的文本创建一个字典,其中包含所有单独的文本,键为>piece_1等。

到目前为止,我设法用所有键填充字典,但我不知道如何从文件中提取文本。

f = open('Output.txt', 'r')
mydict = dict()

for index, line in enumerate(f):
    if line[:1]=='>':
        mydict[index] = line #instead, the key should be line with the value being the relative text.
        print(line, end='')

4 个答案:

答案 0 :(得分:3)

我建议使用Biopython,它比编写您自己的解决方案更加健壮和简洁:

>>> from Bio import SeqIO
>>> d = SeqIO.to_dict(SeqIO.parse('input.fa', 'fasta'))

为您的数据:

>>> d['piece_1']
SeqRecord(seq=Seq('Loremipsumdolorsitametconsecteturadipiscingelit.Namape...mi.', SingleLetterAlphabet()), id='piece_1', name='piece_1', description='piece_1', dbxrefs=[])
>>> str(d['piece_1'].seq)
'Loremipsumdolorsitametconsecteturadipiscingelit.Namapellentesquemi.'

答案 1 :(得分:1)

您可以是collections.defaultdict

from collections import defaultdict
result = defaultdict(list)
index = None
for line in text:
    if line.startswith(">"):
        index = line[1:]
    else:
        result[index].append(line)
{
    "piece_1 ": [
        "Lorem ipsum dolor sit amet",
        "consectetur adipiscing elit. Nam a pellentesque mi. ",
    ],
    "piece_2 ": [
        "Integer dignissim ultrices eros a consequat. Praesent vestibulum"
    ],
    "piece_3 ": [
        "Morbi eget sollicitudin mauris. Nunc varius felis ",
        "vitae dui congue hendrerit. Nam semper venenatis auctor.  ",
        "Suspendisse potenti. Suspendisse facilisis velit vel convallis ",
        "fringilla. Duis condimentum auctor mauris eu lobortis.",
    ],
}

答案 2 :(得分:1)

这是使用简单迭代的一种方法。

例如:

result = []
with open(filename) as infile:
    for line in infile:
        if line.startswith(">"):             #Check if line starts with '>'
            result.append([line, []])        #Create new list with format --> [key, [list of corresponding text]]
        else:
            result[-1][1].append(line)       #Append text to previously found key. 

mydict ={k: "".join(v) for k, v in result}   #Form required dictionary. 
print(mydict)

输出:

{'>piece_1 \n': 'Lorem ipsum dolor sit amet\nconsectetur adipiscing elit. Nam a pellentesque mi. \n',
 '>piece_2 \n': 'Integer dignissim ultrices eros a consequat. Praesent vestibulum\n',
 '>piece_3 \n': 'Morbi eget sollicitudin mauris. Nunc varius felis \nvitae dui congue hendrerit. Nam semper venenatis auctor.  \nSuspendisse potenti. Suspendisse facilisis velit vel convallis \nfringilla. Duis condimentum auctor mauris eu lobortis. '}

答案 3 :(得分:0)

这是使用列表和字典理解的另一个紧凑可能性:

with open('Output.txt', 'r') as f:
    s = f.read()
result = {k.strip(): v for k, v in [part.split('\n', maxsplit=1)
                                    for part in s.split('>')[1:]] }

在内部列表理解中:s.split('>')返回的第0个列表元素是一个空字符串,因此我们将其忽略。 maxsplit=1在随后的\n处进行拆分可防止将文本拆分为2个以上。