在字符串中查找单词的概率

时间:2016-09-28 14:29:29

标签: python statistics

如果我有一个更长的字符串,我如何计算在该字符串中找到给定长度的单词的概率?

到目前为止,我有这个:

import math
from scipy import stats

alphabet = list("ATCG")  # This is the alphabet I am working with
string = "AATCAGTAGATCG"  # Here are two example strings
string2 = "TGTAAACCTTGGTTTATCG"
word = "ATCG"  # This is my word

n_substrings = len(string) - len(word)  # The number of possible substrings
n_substrings2 = len(string2) - len(word)

prob_match = math.pow(len(alphabet), - len(word))  # The probability of randomly choosing the word from the alphabet

# Get the probability from a binomial test?
print stats.binom_test(1, n_substrings, p=prob_match)  # (Number of successes, number of trials, prob of success)
print stats.binom_test(1, n_substrings2, p=prob_match)

>>>0.0346119111615
   0.0570183821615

这是一种合适的方法吗?或者我错过了什么?

1 个答案:

答案 0 :(得分:1)

我认为你应该这样做:

n_substrings = len(string) - len(word) +1

在5个字母的字符串中,有4个字母的子字符串,您有2个选项: ATCGA可以举办ATCG和TCGA