Question

我想为字符串中的字符构建一个概率范围列表，因此我可以对它们进行Arithmetic Coding。以下是我希望完成的示例（来自教程/概述here）：

a   30%     [0.00, 0.30)
b   15%     [0.30, 0.45)
c   25%     [0.45, 0.70)
d   10%     [0.70, 0.80)
e   20%     [0.80, 1.00)

以我的实现方式以Python的方式表达，这看起来像：

[(0.00, 0.30), (0.30, 0.45), (0.45, 0.70), (0.70, 0.80), (0.80, 1.00)]

使用与该列表匹配的关联字符列表。范围必须是唯一的，并且它们不得相互碰撞。（注意，每个范围的上限实际上是9的无限长列表，因为给定范围的上限本身被列表中的下一个范围占用。）< / p>

这是我目前的实施方式：

from decimal import Decimal, getcontext
getcontext().prec = 2

def _nodupsfreq(string):
    """list deduplicator"""
    l, res = [], []
    for ch in string:
        if ch not in l:
            l.append(ch)
            res.append(string.count(ch))
    return l, res

def getprobs(string):
    """return a set of probability ranges for a string"""
    k, v = _nodupsfreq(string)
    rs = [(0, 0)]            # need a 0th element for first iteration (messy)
    x = []                   # construct the keys ensuring they match 
    for i in range(len(v)):
        y = 0 if i == 0 else i - 1 # this is the reason for the 0th element
        lower = rs[y][1]
        upper = Decimal(lower) + Decimal(v[i] / len(string))
        res = (lower, upper)
        rs.append(res)
        x.append(k[i])
    return rs[1:], x  # more messiness because of the first item

def probs_as_dict(string):
    """get a list of probability ranges as a dictionary"""
    l, k = getprobs(string)
    d = {}
    for i in range(len(k)):
        d[k[i]] = (float(l[i][0]), float(l[i][1]))
    return d

m = "BILL GATES"
__import__("pprint").pprint(probs_as_dict(m))

在理论中，它完成了它在锡上所说的内容，但在练习中，它保持范围的唯一方式是“唯一的”＃34;是通过将下一次迭代中的范围基于最后一次迭代中范围的上限，这显然是脆弱的，结果显示：

{
 ' ': (0.1, 0.2),
 'A': (0.2, 0.3),
 'B': (0.0, 0.1),  # occupied here!
 'E': (0.3, 0.4),  # occupied here!
 'G': (0.3, 0.4),  # junk
 'I': (0.0, 0.1),  # junk
 'L': (0.1, 0.3),
 'S': (0.5, 0.6),
 'T': (0.4, 0.5)
}

相同和不同长度的重叠范围。

当然，我可以在我的实现中进行更多操作，并且可以更好地消除欺骗范围，或者我可以选择一种更好的方法来首先从字符串生成概率范围。

是否有更好的方法来表达字符串中给定字符的概率，更好的方法是确保集合中的唯一和非折叠多维元素，或者我应该如何修复代码？

Answer 1

要计算字符概率范围的长度，您可以简单地将字符的出现次数除以字符串的总长度。

以这种方式获得的这些范围长度足以描述范围，因为您知道第一个范围从0.0开始，并且每个连续范围从前一个范围开始。

这样，就不需要明确保存范围边界，从而消除了碰撞的可能性。如果您需要计算中的边界，可以使用函数轻松计算它们。

因此，在您的示例中，您只需保存一个实数数组，如下所示：

[30.0, 15.0, 25.0, 10.0, 20.0]

如何跟踪2D范围列表中的唯一范围？

1 个答案: