我有一个字符串
strA='ABCCBAAABBCCABABC'
和长度小于strA(例如长度4)的字典(列表列表),该字典有可能在给定位置出现每个字符:
{"A":[0.4, 0.5, 0.1, 0.2], "B": [0.3, 0.5, 0.3, 0.6], "C":[0.3, 0.0, 0.6, 0.2]}
我现在想要一个长度为4的滑动窗口,并根据该字典对我的整个strA进行评分。 因此,来自srtA的前4个字符是ABCC,根据字典,得分为:
(p(A) at pos0)*(p(B) at pos1)*(p(C)at pos2)*(p(C) at pos(3))
= 0.4 * 0.5 * 0.6 * 0.2 = 0.024
现在,我想以1为增量滑动此窗口,并在此窗口长度的整个字符串中找到最高的得分。
现在我的工作方式是:
score_new=0
window_len=4
for a in range(0,(len(strA)-window_len):
slide=strA[a:a+window_len]
score=1
for b in range(0,len(silde)):
score=score*(dict[slide[b]][b])
if score>score_new:
score_new=score
return score_new
有效,但很耗时。我必须对10000个具有1000个不同长度的可变长度窗口的字符串进行评分,每个窗口具有不同的概率字典。
是否有一种基于概率词典对字符串进行评分并仅返回最高评分的更快方法?
答案 0 :(得分:1)
这是我想出的:
def compute_best_score(test_string, prob_dict, window_size):
# assertion to check for uniform lengths
assert(all([len(prob_list) == window_size for prob_list in prob_dict.values()]))
best_window_index = -1
best_window_score = 0
for index, letter in enumerate(test_string):
# ignore last region of test_string as a window cannot be computed
if index > len(test_string) - window_size:
break
current_window_score = 1.0
for score_index in range(window_size):
current_char = test_string[index + score_index]
letter_scores = pdict[current_char]
current_window_score *= letter_scores[score_index]
if current_window_score > best_window_score:
best_window_score = current_window_score
best_window_index = index
return best_window_score, best_window_index
在基于Python的分析中,此方法似乎比基于reduce的答案要好(这让我感到惊讶)。这是我执行的简单分析运行的链接:
答案 1 :(得分:0)
>>> from functools import reduce
>>> strA='ABCCBAAABBCCABABC'
>>> d = {"A":[0.4, 0.5, 0.1, 0.2], "B": [0.3, 0.5, 0.3, 0.6], "C":[0.3, 0.0, 0.6, 0.2]}
>>> n = 4
>>>
>>> f = lambda t: reduce(lambda a,b: a*b, [d[c][i] for i,c in enumerate(t)])
>>> plist = map(f, zip(*[strA[i:] for i in range(n)]))
>>> plist
[0.024, 0.0, 0.0, 0.003, 0.003, 0.012000000000000002, 0.036, 0.012, 0.018, 0.0, 0.0, 0.009, 0.012000000000000002, 0.009]
>>>
>>> max(plist)
0.036
>>>