根据一组字典值对字符串评分

时间:2018-07-11 21:29:27

标签: python dictionary

我有一个字符串

strA='ABCCBAAABBCCABABC'

和长度小于strA(例如长度4)的字典(列表列表),该字典有可能在给定位置出现每个字符:

{"A":[0.4, 0.5, 0.1, 0.2], "B": [0.3, 0.5, 0.3, 0.6], "C":[0.3, 0.0, 0.6, 0.2]}

我现在想要一个长度为4的滑动窗口,并根据该字典对我的整个strA进行评分。 因此,来自srtA的前4个字符是ABCC,根据字典,得分为:

(p(A) at pos0)*(p(B) at pos1)*(p(C)at pos2)*(p(C) at pos(3))

= 0.4 * 0.5 * 0.6 * 0.2 = 0.024

现在,我想以1为增量滑动此窗口,并在此窗口长度的整个字符串中找到最高的得分。

现在我的工作方式是:

    score_new=0
    window_len=4
    for a in range(0,(len(strA)-window_len):
            slide=strA[a:a+window_len]

            score=1
            for b in range(0,len(silde)):
                score=score*(dict[slide[b]][b])

            if score>score_new:
                score_new=score

        return score_new

有效,但很耗时。我必须对10000个具有1000个不同长度的可变长度窗口的字符串进行评分,每个窗口具有不同的概率字典。

是否有一种基于概率词典对字符串进行评分并仅返回最高评分的更快方法?

2 个答案:

答案 0 :(得分:1)

这是我想出的:

def compute_best_score(test_string, prob_dict, window_size):
    # assertion to check for uniform lengths
    assert(all([len(prob_list) == window_size for prob_list in prob_dict.values()]))
    best_window_index = -1
    best_window_score = 0
    for index, letter in enumerate(test_string):
        # ignore last region of test_string as a window cannot be computed
        if index > len(test_string) - window_size:
            break
        current_window_score = 1.0
        for score_index in range(window_size):
            current_char = test_string[index + score_index]
            letter_scores = pdict[current_char]
            current_window_score *= letter_scores[score_index]
        if current_window_score > best_window_score:
            best_window_score = current_window_score
            best_window_index = index
    return best_window_score, best_window_index

在基于Python的分析中,此方法似乎比基于reduce的答案要好(这让我感到惊讶)。这是我执行的简单分析运行的链接:

http://tpcg.io/S05WeL

答案 1 :(得分:0)

>>> from functools import reduce
>>> strA='ABCCBAAABBCCABABC'
>>> d = {"A":[0.4, 0.5, 0.1, 0.2], "B": [0.3, 0.5, 0.3, 0.6], "C":[0.3, 0.0, 0.6, 0.2]}
>>> n = 4
>>>
>>> f = lambda t: reduce(lambda a,b: a*b, [d[c][i] for i,c in enumerate(t)])
>>> plist = map(f, zip(*[strA[i:] for i in range(n)]))
>>> plist
[0.024, 0.0, 0.0, 0.003, 0.003, 0.012000000000000002, 0.036, 0.012, 0.018, 0.0, 0.0, 0.009, 0.012000000000000002, 0.009]
>>> 
>>> max(plist)
0.036
>>>