Python遍历文件并从多行生成列表

时间:2019-06-27 16:41:41

标签: python list file loops

所以我有一个DNA fasta文件。其格式如下:(Input Image

  

Rosalind_8728   ATGGAGCCGCACATATAACGGTAAATGCAAAGAAACAGTTCGGGAAAGATATTCAACCAA   GGCAACTTCCTGCACTCGTCGCGGGCACGTAGGGAGCCTGACCATCCCTACCCAGACTGT   CCCCGATCAGCGAACGGGCCATGCGCTATCAGGTCGATCTAGCACTTGGTAAGTTACGCC   AGCTGTACTGAAACAATGCCCGTAGTGACTGAGACGCCAGGGAAAAGGGGATTAAAGCTA   TGGTAGCCAATCGTCCTAACCTCTAGCCCGCCTGGTATGTAAGAACAAGACATCAGAGAT   ATAGAGGCAGACCGGACCTGCAAGCCGGTCACCTGTGGCTCCCGACAAATGTGGCGTTTA   GCTTATGCAAGACCGAAGCTTAGAACCAAGTCGGCTTCGTACCCCTTCTTACCTGTCCAC   TGCAGTGTTTTGCCTGGATCCGGGTGCGCGTGGCACGAGATCTGCTGAGAAGCTATGAAC   AATCAAATGTGTAGCCCGCTTACGAAGAATCCAGCCCTGAATTCGGGGGCCAGTCTTCGC   CGAACTCCCCCTATTGAGTGGTAAAGTGTGTGACTCCTAGTCTTTTCACCCGAGTCGTTGTG   AATTGTTAGGCTACAGATTTCGCATAGCCCTGATCCAAGCCTTTCTCTGAAAAGATGCGA   CCTGCATCACTAAGGCCAACCGTGTGTCTCTCCGACATTACGGCAGTGCCACTGATCGCT   CACGAACTTGGGAAGCCCCAAAAACTCACATGAGTATGTAGGGCAGTTTTATAGGCTGGG   CCCACCCACTTGGTTAGCAAATGGCGCCTGCTCAGAACTCCTTTTACGTAAGTGGTCCCA   GTGTGATGGGTCGAGTGAACAAACAAATGTTGACAATTTGCCTCGGGGTTA

     

Rosalind_6085   CGGATCTGCGTACGGTTGCGTATCCCGTTCAAATGCTCCATCACTCATCACGGAGCCACG   TTCCGACCTGCCCACATCTGCGTCTAATACCACGCCAGTACTTACCACGCCGCTGGGTCT   TCGAGAACGAGGCTGAATGGGTTTCCGGGGGTGGGAAAGTAATACAAGCGTCATTCGTGA   ACTGGGACCATGTCATCTGGCGAAGCTATAGTGCGATCGAACTAAACGCTAATACGTCGA   AACAGTCTATGGCCGTGAACTTTCTCTAGAGGGTAGGGTTCTTAGCCCCGCCTATTACTT   GAACGGATATCAAAGACAGACTTAGCATCTCTGTACCCGCCCTACTGTTGCTTCAAAAGTCA   TGCGGAGATTTGTGGGAGCTTGGTCACCTATCGGGCACATCCAGAATGGTCTTTCTCGTA   GGTTGAAACAGCCGGGATGCACGTGTGTTTTGTAGGCAAATATAGTGTTTCCGGTGCTAA   CTAGATTGAGGCAACTCCTATGCCAGAGCATACGGATAGAGACCGAATTGTTTATATGTGTG   CGTTTACCCGATCAGATGCAGTACTTTGGTGGGCAATTTTAGTGAATTGCTCACGTGTTT   TAATAACCGGTCCAAGGTTACCTCCCGCCACGTCATAGAGAAATGGGGGAGTATAGAGAG   GTAGCTTCTTTCCACACTTGCTTCGAAAAGTGGCCCTCCCTAGGCCACTCCAGATCACTT   CCCTCGCAGCCGATACTTTAAATCTGTTCTCGACTGGTTTAACGTTTTGAGCGAGATTGT   GCAGGTCTATCGTCGAGTTTTAGGAGAAACCGTGGCTGTCTCAAACCGGTAGCGACCAAG   TAACTTGTGTGGTGTGTGGCGCGTACCCCTTTTCCTTTCCGACAACACTGTACCCCTAGATA   TAGTGGAATCAGTGAATCAAGATCTACCGGGAATAGACACTCGCTTGAGAAAACATTTCC

我最终想看一下这些Rosalind_ids中具有最大G和C的DNA。因此,我的思考过程是列出id标签,然后列出与之相关的所有dna。然后将它们压缩到字典中,并创建一个函数来确定最高GC字母浓度。问题是当我追加多个字母的行时,我得到一个列表,其中每行用','分隔,而不是包含在rosalind_id_tag下面的ALL行的1个列表,然后如果它们是新标签,则用','分隔它们。 / p>

所以最终我想要:

dna = [list of letters from first random_id, list of letters from second_random_id, ...] 

而不是我得到的是什么

dna = [this is first line, this is second line, this is third line,..] 

我尝试了扩展,但这似乎不起作用。

我曾尝试制作嵌套列表,并将它们附加到我的主要DNA列表中

到目前为止(有效)的代码是:

file = open("rosalind_gc.txt", "r")


data = file.readlines()

rosalindtags = []

dna = []


for a in data:

    if a.startswith(">"):

        rosalindtags.append(a.rstrip())

    else:

       dna.append(a.rstrip())

dictionary = dict(zip(rosalindtags, dna))

file.close()

我知道我缺少一些琐碎的东西,但我只是不知道它是什么。感谢您的任何帮助,谢谢!

3 个答案:

答案 0 :(得分:1)

问题在于,每个id都有一行,而DNA却有多行。 创建rosalindtag时,可以在dna后面附加一个空字符串。遇到DNA线时,可以将其添加到dna的最后一个元素中:

file = open("rosalind_gc.txt", "r")
data = file.readlines()
rosalindtags = []
dna = []

for a in data:
    if a.startswith(">"):
        rosalindtags.append(a.rstrip())
        dna.append('')
    else:
        dna[-1] = dna[-1] + a.rstrip()

dictionary = dict(zip(rosalindtags, dna))
file.close()

dictionary然后是:

{'>Rosalind_8728': 'ATGGAGCCGCACATATAACGGTAAATGCAAAGAAACAGTTCGGGAAAGATATTCAACCAAGGCAACTTCCTGCACTCGTCGCGGGCACGTAGGGAGCCTGACCATCCCTACCCAGACTGTCCCCGATCAGCGAACGGGCCATGCGCTATCAGGTCGATCTAGCACTTGGTAAGTTACGCCAGCTGTACTGAAACAATGCCCGTAGTGACTGAGACGCCAGGGAAAAGGGGATTAAAGCTATGGTAGCCAATCGTCCTAACCTCTAGCCCGCCTGGTATGTAAGAACAAGACATCAGAGATATAGAGGCAGACCGGACCTGCAAGCCGGTCACCTGTGGCTCCCGACAAATGTGGCGTTTAGCTTATGCAAGACCGAAGCTTAGAACCAAGTCGGCTTCGTACCCCTTCTTACCTGTCCACTGCAGTGTTTTGCCTGGATCCGGGTGCGCGTGGCACGAGATCTGCTGAGAAGCTATGAACAATCAAATGTGTAGCCCGCTTACGAAGAATCCAGCCCTGAATTCGGGGGCCAGTCTTCGCCGAACTCCCCCTATTGAGTGGTAAAGTGTGTGACTCCTAGTCTTTTCACCCGAGTCGTTGAATTGTTAGGCTACAGATTTCGCATAGCCCTGATCCAAGCCTTTCTCTGAAAAGATGCGACCTGCATCACTAAGGCCAACCGTGTGTCTCTCCGACATTACGGCAGTGCCACTGATCGCTCACGAACTTGGGAAGCCCCAAAAACTCACATGAGTATGTAGGGCAGTTTTATAGGCTGGGCCCACCCACTTGGTTAGCAAATGGCGCCTGCTCAGAACTCCTTTTACGTAAGTGGTCCCAGTGTGATGGGTCGAGTGAACAAACAAATGTTGACAATTTGCCTCGGGGTTA',
 '>Rosalind_6085': 'CGGATCTGCGTACGGTTGCGTATCCCGTTCAAATGCTCCATCACTCATCACGGAGCCACGTTCCGACCTGCCCACATCTGCGTCTAATACCACGCCAGTACTTACCACGCCGCTGGGTCTTCGAGAACGAGGCTGAATGGGTTTCCGGGGGTGGGAAAGTAATACAAGCGTCATTCGTGAACTGGGACCATGTCATCTGGCGAAGCTATAGTGCGATCGAACTAAACGCTAATACGTCGAAACAGTCTATGGCCGTGAACTTTCTCTAGAGGGTAGGGTTCTTAGCCCCGCCTATTACTTGAACGGATATCAAAGACAGACTTAGCATCTCTGTACCCGCCCTACTGTTGCTTCAAGTCATGCGGAGATTTGTGGGAGCTTGGTCACCTATCGGGCACATCCAGAATGGTCTTTCTCGTAGGTTGAAACAGCCGGGATGCACGTGTGTTTTGTAGGCAAATATAGTGTTTCCGGTGCTAACTAGATTGAGGCAACTCCTATGCCAGAGCATACGGATAGAGACCGAATTGTTTATATGTGCGTTTACCCGATCAGATGCAGTACTTTGGTGGGCAATTTTAGTGAATTGCTCACGTGTTTTAATAACCGGTCCAAGGTTACCTCCCGCCACGTCATAGAGAAATGGGGGAGTATAGAGAGGTAGCTTCTTTCCACACTTGCTTCGAAAAGTGGCCCTCCCTAGGCCACTCCAGATCACTTCCCTCGCAGCCGATACTTTAAATCTGTTCTCGACTGGTTTAACGTTTTGAGCGAGATTGTGCAGGTCTATCGTCGAGTTTTAGGAGAAACCGTGGCTGTCTCAAACCGGTAGCGACCAAGTAACTTGTGTGGTGTGGCGCGTACCCCTTTTCCTTTCCGACAACACTGTACCCCTAGATATAGTGGAATCAGTGAATCAAGATCTACCGGGAATAGACACTCGCTTGAGAAAACATTTCCTC'}

请注意,如果您读取大文件,则此方法将需要大量内存。

这是一种逐行读取文件并且仅将字母计数保留在内存中的替代方法:

from collections import Counter

rosalin_id = None
dna = {}

with open("rosalind_gc.txt") as rosalin_f:
    for line in rosalin_f:
        if line.startswith(">"):
            rosalin_id = line.rstrip()
            dna[rosalin_id] = Counter()
        else:
            dna[rosalin_id] += Counter(line.rstrip())

dna

它返回:

{'>Rosalind_8728': Counter({'A': 228, 'T': 202, 'G': 225, 'C': 236}),
 '>Rosalind_6085': Counter({'C': 236, 'G': 237, 'A': 231, 'T': 258})}

答案 1 :(得分:1)

对于基于生物信息学的方法,您还可以尝试下载具有可直接从SwissProt读取fasta文件的扩展名的biopython。

from Bio import SeqIO, ExPASy

protein_name = "Rosalind_6085"
with ExPASy.get_sprot_raw(protein_cleaned) as handle:
     seq_record = SeqIO.read(handle, "swiss")

proteinseq = seq_record.seq

从这里开始,proteineqeq将是一个可以与其他字符串进行比较的字符串。

答案 2 :(得分:0)

好吧,我们假设我们从文件中读取了该结果并将结果存储在data中,这样我们就可以看到我们正在处理的内容:

data = """>Rosalind_8728
ATGGAGCCGCACATATAACGGTAAATGCAAAGAAACAGTTCGGGAAAGATATTCAACCAA
GGCAACTTCCTGCACTCGTCGCGGGCACGTAGGGAGCCTGACCATCCCTACCCAGACTGT
CCCCGATCAGCGAACGGGCCATGCGCTATCAGGTCGATCTAGCACTTGGTAAGTTACGCC
AGCTGTACTGAAACAATGCCCGTAGTGACTGAGACGCCAGGGAAAAGGGGATTAAAGCTA
TGGTAGCCAATCGTCCTAACCTCTAGCCCGCCTGGTATGTAAGAACAAGACATCAGAGAT
ATAGAGGCAGACCGGACCTGCAAGCCGGTCACCTGTGGCTCCCGACAAATGTGGCGTTTA
GCTTATGCAAGACCGAAGCTTAGAACCAAGTCGGCTTCGTACCCCTTCTTACCTGTCCAC
TGCAGTGTTTTGCCTGGATCCGGGTGCGCGTGGCACGAGATCTGCTGAGAAGCTATGAAC
AATCAAATGTGTAGCCCGCTTACGAAGAATCCAGCCCTGAATTCGGGGGCCAGTCTTCGC
CGAACTCCCCCTATTGAGTGGTAAAGTGTGTGACTCCTAGTCTTTTCACCCGAGTCGTTG
AATTGTTAGGCTACAGATTTCGCATAGCCCTGATCCAAGCCTTTCTCTGAAAAGATGCGA
CCTGCATCACTAAGGCCAACCGTGTGTCTCTCCGACATTACGGCAGTGCCACTGATCGCT
CACGAACTTGGGAAGCCCCAAAAACTCACATGAGTATGTAGGGCAGTTTTATAGGCTGGG
CCCACCCACTTGGTTAGCAAATGGCGCCTGCTCAGAACTCCTTTTACGTAAGTGGTCCCA
GTGTGATGGGTCGAGTGAACAAACAAATGTTGACAATTTGCCTCGGGGTTA
>Rosalind_6085
CGGATCTGCGTACGGTTGCGTATCCCGTTCAAATGCTCCATCACTCATCACGGAGCCACG
TTCCGACCTGCCCACATCTGCGTCTAATACCACGCCAGTACTTACCACGCCGCTGGGTCT
TCGAGAACGAGGCTGAATGGGTTTCCGGGGGTGGGAAAGTAATACAAGCGTCATTCGTGA
ACTGGGACCATGTCATCTGGCGAAGCTATAGTGCGATCGAACTAAACGCTAATACGTCGA
AACAGTCTATGGCCGTGAACTTTCTCTAGAGGGTAGGGTTCTTAGCCCCGCCTATTACTT
GAACGGATATCAAAGACAGACTTAGCATCTCTGTACCCGCCCTACTGTTGCTTCAAGTCA
TGCGGAGATTTGTGGGAGCTTGGTCACCTATCGGGCACATCCAGAATGGTCTTTCTCGTA
GGTTGAAACAGCCGGGATGCACGTGTGTTTTGTAGGCAAATATAGTGTTTCCGGTGCTAA
CTAGATTGAGGCAACTCCTATGCCAGAGCATACGGATAGAGACCGAATTGTTTATATGTG
CGTTTACCCGATCAGATGCAGTACTTTGGTGGGCAATTTTAGTGAATTGCTCACGTGTTT
TAATAACCGGTCCAAGGTTACCTCCCGCCACGTCATAGAGAAATGGGGGAGTATAGAGAG
GTAGCTTCTTTCCACACTTGCTTCGAAAAGTGGCCCTCCCTAGGCCACTCCAGATCACTT
CCCTCGCAGCCGATACTTTAAATCTGTTCTCGACTGGTTTAACGTTTTGAGCGAGATTGT
GCAGGTCTATCGTCGAGTTTTAGGAGAAACCGTGGCTGTCTCAAACCGGTAGCGACCAAG
TAACTTGTGTGGTGTGGCGCGTACCCCTTTTCCTTTCCGACAACACTGTACCCCTAGATA
TAGTGGAATCAGTGAATCAAGATCTACCGGGAATAGACACTCGCTTGAGAAAACATTTCC
TC"""

lines = data.splitlines(False)

d = {}
n = len(lines)
i = 0
while i < n:
    line = lines[i]
    if line[0] == ">":
        id = line
        i += 1
        dna = ''
        while i < n:
            line = lines[i]
            if line[0] != '>':
                dna += line
                i += 1
            else:
                break
        d[id] = dna
    else:
        # unexpected, so skip until you find a tag
        i += 1

for k, v in d.items():
    print(k, ':', v, "\n", sep='')

输出将是(由于我用控制台固定了,所以行被分割了):

>Rosalind_8728:ATGGAGCCGCACATATAACGGTAAATGCAAAGAAACAGTTCGGGAAAGATATTCAACCAAGGCAACTTCCTGCACTCGTCGCGGGCACGTAGGGAGCCTGACCATCCCTACCCAGACTGTCCCCGATCAGCGAAC
GGGCCATGCGCTATCAGGTCGATCTAGCACTTGGTAAGTTACGCCAGCTGTACTGAAACAATGCCCGTAGTGACTGAGACGCCAGGGAAAAGGGGATTAAAGCTATGGTAGCCAATCGTCCTAACCTCTAGCCCGCCTGGTATGTAAGAA
CAAGACATCAGAGATATAGAGGCAGACCGGACCTGCAAGCCGGTCACCTGTGGCTCCCGACAAATGTGGCGTTTAGCTTATGCAAGACCGAAGCTTAGAACCAAGTCGGCTTCGTACCCCTTCTTACCTGTCCACTGCAGTGTTTTGCCT
GGATCCGGGTGCGCGTGGCACGAGATCTGCTGAGAAGCTATGAACAATCAAATGTGTAGCCCGCTTACGAAGAATCCAGCCCTGAATTCGGGGGCCAGTCTTCGCCGAACTCCCCCTATTGAGTGGTAAAGTGTGTGACTCCTAGTCTTT
TCACCCGAGTCGTTGAATTGTTAGGCTACAGATTTCGCATAGCCCTGATCCAAGCCTTTCTCTGAAAAGATGCGACCTGCATCACTAAGGCCAACCGTGTGTCTCTCCGACATTACGGCAGTGCCACTGATCGCTCACGAACTTGGGAAG
CCCCAAAAACTCACATGAGTATGTAGGGCAGTTTTATAGGCTGGGCCCACCCACTTGGTTAGCAAATGGCGCCTGCTCAGAACTCCTTTTACGTAAGTGGTCCCAGTGTGATGGGTCGAGTGAACAAACAAATGTTGACAATTTGCCTCG
GGGTTA

>Rosalind_6085:CGGATCTGCGTACGGTTGCGTATCCCGTTCAAATGCTCCATCACTCATCACGGAGCCACGTTCCGACCTGCCCACATCTGCGTCTAATACCACGCCAGTACTTACCACGCCGCTGGGTCTTCGAGAACGAGGCTG
AATGGGTTTCCGGGGGTGGGAAAGTAATACAAGCGTCATTCGTGAACTGGGACCATGTCATCTGGCGAAGCTATAGTGCGATCGAACTAAACGCTAATACGTCGAAACAGTCTATGGCCGTGAACTTTCTCTAGAGGGTAGGGTTCTTAG
CCCCGCCTATTACTTGAACGGATATCAAAGACAGACTTAGCATCTCTGTACCCGCCCTACTGTTGCTTCAAGTCATGCGGAGATTTGTGGGAGCTTGGTCACCTATCGGGCACATCCAGAATGGTCTTTCTCGTAGGTTGAAACAGCCGG
GATGCACGTGTGTTTTGTAGGCAAATATAGTGTTTCCGGTGCTAACTAGATTGAGGCAACTCCTATGCCAGAGCATACGGATAGAGACCGAATTGTTTATATGTGCGTTTACCCGATCAGATGCAGTACTTTGGTGGGCAATTTTAGTGA
ATTGCTCACGTGTTTTAATAACCGGTCCAAGGTTACCTCCCGCCACGTCATAGAGAAATGGGGGAGTATAGAGAGGTAGCTTCTTTCCACACTTGCTTCGAAAAGTGGCCCTCCCTAGGCCACTCCAGATCACTTCCCTCGCAGCCGATA
CTTTAAATCTGTTCTCGACTGGTTTAACGTTTTGAGCGAGATTGTGCAGGTCTATCGTCGAGTTTTAGGAGAAACCGTGGCTGTCTCAAACCGGTAGCGACCAAGTAACTTGTGTGGTGTGGCGCGTACCCCTTTTCCTTTCCGACAACA
CTGTACCCCTAGATATAGTGGAATCAGTGAATCAAGATCTACCGGGAATAGACACTCGCTTGAGAAAACATTTCCTC

如果需要实际计数,则在文件开头添加:

from collections import Counter

然后将d[id] = dna替换为d[id] = Counter(dna)。然后您得到:

>Rosalind_8728:Counter({'C': 236, 'A': 228, 'G': 225, 'T': 202})

>Rosalind_6085:Counter({'T': 258, 'G': 237, 'C': 236, 'A': 231})