Question

所以我有这个作业问题，我需要在某些部分计算每个字母字符。样本文件：

document.querySelector('.prevent-default').addEventListener('click', (e)=>{
   e.preventDefault();
}, false);

如何让Python在4451和6341部分中计算字符数？该文件每次都不同，所以我不能仅仅手动使它计算字母的行数

另外，这是我的代码

    <input type="checkbox" id="1" />
    <label for="1">foo some part <span class="prevent-default">not</span> clickable</label>

Answer 1

可以使用字典来保存id和base。如果您事先知道所需的ID，则可以遍历字典（按所需的ID）并计算每个ID的底数

您可以利用集合中的计数器来计数每个序列的碱基。

from collections import Counter

d = {} # dictionary to hold fasta data

file = input('Filename: ')

with open(file, 'r') as fasta:
    for line in fasta:
        line = line.rstrip()
        if line.startswith('>'):
            id = line
            d[id] = ''
        else:
            d[id] += line

wanted = ['>Rosalind_4451', '>Rosalind_6341']

for id in wanted:
    print(id)
    seen = Counter(d[id])
    CG_com = (seen.get('G', 0) + seen.get('C', 0)) / sum(seen.values())
    print(format(CG_com, '.4f'))

对于您的数据，我收到的输出是：

>Rosalind_4451
0.4912
>Rosalind_6341
0.5042

Answer 2

您可以import re并使用re.split分割不同的部分，前提是它们遵循相同的格式，然后在每个部分上使用.count()

Answer 3

尝试使用以下正则表达式来确定该行是否包含节标题（在我们的情况下，将其称为定界符）：

'> \ w + \ _ \ d + \ n'

这将通过> Rosalind_4451和> Rosalind_6341及类似格式的分隔符。

只要在一行中找到匹配项，就将所有字母的计数重新初始化为0。希望这会有所帮助。

P.S：确保使用以下语句导入正则表达式。

import re

Answer 4

您可以稍微修改代码：

# Automatically closes file at end, good practice
with open('filename.txt', 'r') as txt:

    lines = txt.readlines()
    for ii in range(0, len(lines) // 2, 2):

        # String objects have a built-in method to see if it starts with a substring
        if lines[ii].startswith(">Rosalind_9690"):

            # Cast to float right away
            a = float(lines[ii+1].count("A"))
            g = float(lines[ii+1].count("G"))
            c = float(lines[ii+1].count("C"))
            t = float(lines[ii+1].count("T"))

            CG_con = (g+c)/(a+g+c+t)
            print (CG_con)

Answer 5

使用正则表达式模式，您可以执行此操作而无需遍历每一行：

import re

txt = open(input()).read()

matchObj = re.search(r'>Rosalind_4451\n([AGTC\n]+)', txt) # group 1 between ()
match = matchObj.group(1) # get group 1 of match object (AGTCGT...) as string

a = float(match.count('A'))
g = float(match.count('G'))
c = float(match.count('C'))
t = float(match.count('T'))

CG_con = (g + c) / (a + g + c + t)
print(CG_con)

您还可以使用f-string设置所需的任何ID：

ID = '4451'
matchObj = re.search(rf'>Rosalind_{ID}\n([AGTC\n]+)', txt)

Answer 6

无需过于复杂，您可以使用re查找模式（即字母）的数量，并使用方便的函数findall来返回所有实例的列表。另外，从您的描述和注释看来，您想为每行添加这些值吗？尚不完全清楚，但是如果要保持每个文件每行每个实例的运行计数，则必须确保将新计数添加到旧计数中

import re

a = 0
b = 0
c = 0
d = 0
for lines in txt:
    if lines == ">Rosalind_9690":
        # adding the re.IGNORECASE flag will match lower and upper case instances
        a += len(re.findall('a', lines, re.IGNORECASE))
        g += len(re.findall('g', lines, re.IGNORECASE))
        c += len(re.findall('c', lines, re.IGNORECASE))
        t += len(re.findall('t', lines, re.IGNORECASE))

使用变量来保存计数并不是那么花哨，如果以后使用dict存储它们，我们可能会发现它更容易：

import re

# use a name that describes the content, I'm assuming the letters 
# are nucleobases
nucleobase_count = {
    'a': 0,
    'b': 0,
    'c': 0,
    'd': 0
}

for lines in txt:
    if lines == ">Rosalind_9690":
        # adding the re.IGNORECASE flag will match lower and upper 
        # case instances
        nucleobase_count['a'] += len(re.findall('a', lines, re.IGNORECASE))
        nucleobase_count['g'] += len(re.findall('g', lines, re.IGNORECASE))
        nucleobase_count['c'] += len(re.findall('c', lines, re.IGNORECASE))
        nucleobase_count['t'] += len(re.findall('t', lines, re.IGNORECASE))

我如何使Python读取文本的特定部分并使它停在我想要的地方

6 个答案: