Question

我试图传递所有不是字母（撇号等）的东西，然后继续。该数字应位于结果中的相应位置。这来自this accepted answer，单词列表为here。

字符串是“thereare7deadlysins”
下面的代码输出“有7个d d y y s s n”s 我试图得到“有7个致命的罪”

我尝试过（下方），但收到IndexError: 'string index out of range'

# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
    if isinstance(s[i], int):
        continue
    c,k = best_match(i)
    assert c == cost[i]
    out.append(s[i-k:i])
    i -= k

整个事情是：

from math import log 
import string

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("/Users/.../Desktop/wordlist.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
table = string.maketrans("","")
l = "".join("thereare7deadlysins".split()).lower()

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""
    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

def test_trans(s):
    return s.translate(table, string.punctuation)


s = test_trans(l)
print(infer_spaces(s))

编辑：根据接受的答案，以下问题解决了我的问题：
1.从单词表中删除单个字母（a，e，i除外）
2.在wordcost下面添加了以下内容。

nums = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
for n in nums:
    wordcost[n] = log(2)

将wordcost更改为（下方）的建议未产生最佳结果。

wordcost = dict( (k, (i+1)*log(1+len(k))) for i,k in enumerate(words) )

例：
字符串：“Recall8importantscreeningquestions”
原来的wordcost：“召回8个重要的筛选问题”
建议的wordcost：“重新调用8个重要的问题”

Answer 1

请注意，单词列表包含所有26个单独的字母作为单词。

通过以下修改，您的算法将正确地推断出输入字符串的空格＆＃34;还有正确的分析＆＃34; （即＆＃34; 7＆＃34;改为＆＃34;七＆＃34;）：

从单词列表中删除单个字母单词（可能除了＆＃34; a＆＃34;和＆＃34; i＆＃34;。）
正如@Pm 2Ring所说，更改 wordcost的定义

为：

wordcost = wordcost = dict( (k, (i+1)*log(1+len(k))) for i,k in enumerate(words) )

所以有一些关于非字母的东西会使你的算法搞砸了。由于您已经删除了标点符号，因此您应该将一串非字母视为一个单词。

例如，如果你添加：

wordcost["7"] = log(2)

（除了上面的更改1和2），您的算法适用于原始测试字符串。

Answer 2

i = len(s) -1

避免IndexError：'字符串索引超出范围' 和

if s[i].isdigit():

是您正在寻找的考试。

推断空间：忽略数字和特殊字符

2 个答案: