Question

我最近偶然发现了编码任务，并且一直在努力使其正确。它是这样的：

给出一个非空字符串s和一个包含非空单词列表的列表word_list，确定是否可以s分割成一个空格分隔的序列一个或多个词典单词。您可能会假设word_list不包含重复项，但是每个单词可以使用多次。

例如，给定：

s = 'whataniceday'
word_list = ['a', 'what', 'an', 'nice', 'day']

返回True，因为'whataniceday'可以细分为'what a nice day'。

我想出了一个非常幼稚的解决方案，该解决方案适用于该特定示例，但不难使它失败，例如，通过向word_list添加一个单词，使列表中其他单词以（即['a', 'wha', 'what', 'an', 'nice', 'day']）。还有很多其他事情可能会弄乱我的解决方案，但是无论如何，这里都是这样：

s = "whataniceday"
word_list = ["h", "a", "what", "an", "nice", "day"]

def can_be_segmented(s, word_list):
    tested_str = s
    buildup_str = ''

    for letter in tested_str:        
        buildup_str += letter

        if buildup_str not in word_list:
            continue

        tested_str = tested_str[len(buildup_str):]
        buildup_str = ''

    return bool(tested_str == '' and buildup_str == '')

print(can_be_segmented(s, word_list))

你们对修复方法有想法吗？也许有解决此问题的更好方法？

Answer 1

>>> import re
>>> s = 'whataniceday'
>>> word_list = ['a', 'what', 'an', 'nice', 'day']
>>> re.match('^(' + '|'.join(f'({s})' for s in word_list) + ')*$', s)
<_sre.SRE_Match object; span=(0, 12), match='whataniceday'>

功能：

import re
def can_be_segmented(s, word_list):
    pattern = re.compile('^(' + '|'.join(f'({s})' for s in word_list) + ')*$')
    return pattern.match(s) is not None

这可能是一种优化，使组不捕获（(?:word)而不是(word)，这样re.match不必跟踪匹配的单词，但是我不会计时。

如果您的单词不只是字母，您可能希望通过re.escape()（如f'({re.escape(s)})'而不是f'({s})'）传递它们。

如果您要混合使用大小写，并且希望匹配的大小写通过re.IGNORECASE或re.I标志（如pattern.match(s, re.I)而不是pattern.match(s)）。 / p>

有关更多信息，请参见re documentation。

Answer 2

这是我的解决方案，为简洁起见，使用生成器表达式进行递归

s = "whataniceday"
word_list = ["h", "ani", "a", "what", "an", "nice", "day"]

def can_be_segmented(s, word_list):
    return s == "" or any(
        s.startswith(word) and can_be_segmented(s[len(word):], word_list)
        for word in word_list)

assert can_be_segmented(s, word_list)
assert not can_be_segmented("whataniannicday", word_list)

此代码指出，如果我们可以找到一个单词，则可以对字符串进行分段，从而使字符串以该单词开头，而字符串的其余部分本身也可以分段。

Answer 3

def can_be_segmented(s, word_list):

    # try every word in word_list
    for word in word_list:

        # if s is equal to a word, then success
        if s == word:
            return True

        # otherwise if s starts with a word, call ourselves recursively
        # with the remainder of s
        elif s.startswith(word):
            if can_be_segmented(s[len(word):], word_list):
                return True

    # we tried every possibility, failure
    return False

Answer 4

在评论中解释

def contains(text, pattern):
    for i in range(len(text) - len(pattern)):
        found = True
 
        for j in range(len(pattern)):
            if text[i + j] != pattern[j]: # comparing each letter
                found = False
                break
 
        if found:
            return True
 
    return False


   
s = 'hatanicda'
word_list = ['a', 'what', 'an', 'nice', 'day']
match = []

for i in word_list:
    if contains(s, i) and len(i) > 3: # 3 since word has to be more than is/are/the to be meaningful 
        match.append(i)

print(bool(match))

False

[Program finished]

检查是否可以使用提供的列表中的单词将字符串拆分为句子

4 个答案: