Question

我只是迈出了尝试学习一点Python的第一步。目前正在通过Rosalind在线课程，旨在教授生物信息学python技能。（顺便说一下，参见：rosalind.info）

我正在努力解决一个特定的问题。我有一个FASTA格式的文件，其格式如下：

>Sequence_Header_1
ACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGT
>Sequence_Header_2
ACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGT

我需要计算文件每个条目中G和C的百分比（不包括标题）并返回此数字，例如：

>Sequence_Header_1
48.75%
>Sequence_header_2
52.43%

到目前为止我的代码是：

file = open("input.txt" , "r")
for line in file:
    if line.startswith(">"):
        print(line.rstrip())        
    else:
        print ('%3.2f' % (line.count('G')+line.count('C')/len(line)*100))
file.close()

我正在做几乎我需要它做什么。我只是遇到了序列数据穿过多行的问题。目前，我获得了文件中每一行的％GC内容，而不是为每个条目返回一个数字，例如：

>Sequence_Header_1
48.75%
52.65%
>Sequence_header_2
52.43%
50.25%

如何将我的公式应用于跨越多行的数据？

提前致谢，

Answer 1

不是你问题的直接答案，但我认为这是一个更好的方法！如果您打算在python中进行更多生物信息学，请查看biopython。它将为您处理fasta文件和其他常见的序列操作（以及更多！）。

所以例如：

from Bio import SeqIO
from Bio.SeqUtils import GC

for rec in SeqIO.parse("input.txt", "fasta"):
    print rec.id,GC(rec.seq)

Answer 2

我认为这就是你要找的东西：

# Read the input file
with open("input.txt", "r") as f:
    s = f.read()

# Loop through the Sequences
for i, b in enumerate(s.split("Sequence_Header_")):
    if not b: continue # At least the first one is empty 
                       # because there is no data before the first sequence
    # Join the lines with data
    data = "".join(b.split("\n")[1:])

    # Print the result
    print("Sequence_Header_{i}\n{p:.2f}%".format(
        i=i, p=(data.count('G')+data.count('C'))/len(data)*100))

注意：我找不到'＆gt;'签到你的例子。如果您的标题以＆gt;开头然后你可以重新编写代码到s.split（“＆gt;”），代码仍然可以正常。

Answer 3

尝试保持正在运行的计数，然后在找到新标题时重置此标记。

count = 0.0
line_length=0.0
seen_header = False
for line in open("input.txt" , "r"): #saves memory.
    if line.startswith('>'):
        if not seen_header:
            header = line.strip()
            seen_header=True
        if line_length > 0.0:
            print header,'\n',('%3.2f' % (100.0*count/line_length))
            count = 0.0
            line_length=0.0
            seen_header = False
    else:
        count += line.count('C') + line.count('C')
        line_length += len(line.strip())
print header,'\n',('%3.2f' % (100.0*count/line_length))

另外要注意python中的除法，记住默认是整数除法。即5/2 = 2.您可以通过在变量中使用小数或float（）来避免这种情况。

编辑：做得更好，也应该是line_length + = len（line.strip（）），以避免将换行符“\ n”计为两个字符。

Answer 4

可能无法将整个文件保存在内存中。假设您不能同时执行s = f.read()，则需要保持字母数和总字母数的运行计数，直到新序列开始。像这样：

file = open("input.txt" , "r")
# for keeping count:
char_count = 0
char_total = 0
for line in file:
    if line.startswith(">"):
        if char_total != 0:
            # starting a new sequence, so calculate pct for last sequence
            char_pct = (char_count / char_total) * 100
            print ('%3.2f' % char_pct)
            # reset the count for the next sequence
            char_total = 0
            char_count = 0
        print(line.rstrip())        
    else:
        # continuing a sequence, so add to running counts
        char_count += line.count('G') + line.count('C')
        char_total += len(line)
file.close()

Answer 5

您可以解析fasta格式并创建一个字典，其中＆gt; ID作为键，序列作为值，如下所示：

    from collections import defaultdict

    def parse_fasta(dataset):
        "Parses data in FASTA format, returning a dictionary of ID's and values"
        records = defaultdict(str)
        record_id = None
        for line in [l.strip() for l in dataset.splitlines()]:
            if line.startswith('>'):
                record_id = line[1:]
            else:
                records[record_id] += line
        return records

或者您可以稍微重写此代码并创建元组/列表。我更喜欢字典，因为它已经被编入索引。如果您仍需要帮助，可以在Rosalind网站上找到我。

将公式应用于跨越多行的数据行

5 个答案: