Question

如果我有一个更长的字符串，我如何计算在该字符串中找到给定长度的单词的概率？

到目前为止，我有这个：

import math
from scipy import stats

alphabet = list("ATCG")  # This is the alphabet I am working with
string = "AATCAGTAGATCG"  # Here are two example strings
string2 = "TGTAAACCTTGGTTTATCG"
word = "ATCG"  # This is my word

n_substrings = len(string) - len(word)  # The number of possible substrings
n_substrings2 = len(string2) - len(word)

prob_match = math.pow(len(alphabet), - len(word))  # The probability of randomly choosing the word from the alphabet

# Get the probability from a binomial test?
print stats.binom_test(1, n_substrings, p=prob_match)  # (Number of successes, number of trials, prob of success)
print stats.binom_test(1, n_substrings2, p=prob_match)

>>>0.0346119111615
   0.0570183821615

这是一种合适的方法吗？或者我错过了什么？

Answer 1

我认为你应该这样做：

n_substrings = len(string) - len(word) +1

在5个字母的字符串中，有4个字母的子字符串，您有2个选项： ATCGA可以举办ATCG和TCGA

在字符串中查找单词的概率

1 个答案: