Question

考虑到密码子的DNA序列，我想得到以A或T开头的密码子的百分比。

DNA序列类似于：dna = "atgagtgaaagttaacgt"。 Eaach序列从0,3,6等位置开始＆lt; - 就我的意图而言，这就是问题的根源

我们写作和作品：

 import re

 DNA = "atgagtgaaagttaacgt"

 def atPct(dna):
     '''
     gets a dna sequence and returns the %
     of sequences that are starting with a or t
     '''
     numOfCodons = re.findall(r'[a|t|c|g]{3}',dna) # [a|t][a|t|c|g]{2} won't give neceseraly in the pos % 3==0 subseq
     count = 0
     for x in numOfCodons:
         if str(x)[0]== 'a' or str(x)[0]== 't':
            count+=1
            print(str(x))

     return 100*count/len(numOfCodons)

print(atPct(DNA))

我的目标是在没有循环的情况下找到它，不知何故我觉得有一种更优雅的方式来做这个只是使用正则表达式，但我可能是错的，如果有更好的方式我会很高兴学习如何！有没有办法跨越位置，“[a|t][a|t|c|g]{2}”作为正则表达式？

p.s问题假设它是一个有效的dna序列，这就是我甚至没有检查过

的原因

Answer 1

循环比以另一种方式更快。您仍然可以使用sum和生成器表达式（another SO answer）来提高可读性：

import re

def atPct(dna):
    # Find all sequences
    numSeqs = re.findall('[atgc]{3}', DNA)

    # Count all sequences that start with 'a' or 't'
    atSeqs = sum(1 for seq in numSeqs if re.match('[at]', seq))

    # Return the calculation
    return 100 * len(numSeqs) / atSeqs 

DNA = "atgagtgaaagttaacgt"
print( atPct(DNA) )

Answer 2

所以，您只想了解字符串中每三个字符中第一个出现a或t的次数百分比？使用切片的步骤参数：

def atPct(dna):
    starts = dna[::3]     # Every third character of dna, starting with the first
    return (starts.count('a') + starts.count('t')) / len(starts)

找到以a或t开头的正则表达式

2 个答案: