Question

由于某些技术问题，所有句子中的所有空格均被删除。（句号除外）

mystring='thisisonlyatest. andhereisanothersentense'

python中是否有任何方法可以获取可读的输出...

“这只是一个测试。这是另一种感觉。”

Answer 1

如果您有一个有效的常用单词列表（可以在互联网上找到不同的语言），则可以获取所有前缀，检查它们是否为有效单词，然后递归重复该句子的其余部分。使用备忘录可防止对相同后缀进行多余的计算。

这是Python中的示例。 lru_cache注释为该函数添加了备注，因此每个后缀的句子仅计算一次，而与第一部分的分割方式无关。请注意，words是用于O（1）查找的set。 Prefix-Tree也可以很好地工作。

words = {"this", "his", "is", "only", "a", "at", "ate", "test", 
         "and", "here", "her", "is", "an", "other", "another",
         "sent", "sentense", "tense", "and", "thousands", "more"}
max_len = max(map(len, words))

import functools
functools.lru_cache(None)
def find_sentences(text):
    if len(text) == 0:
        yield []
    else:
        for i in range(min(max_len, len(text)) + 1):
            prefix, suffix = text[:i], text[i:]
            if prefix in words:
                for rest in find_sentences(suffix):
                    yield [prefix] + rest

mystring = 'thisisonlyatest. andhereisanothersentense'
for text in mystring.split(". "):
    print(repr(text))
    for sentence in find_sentences(text):
        print(sentence)

这将为您提供有效的（但可能不明智的）方式将句子拆分为单词的列表。这些可能不够用，所以您可以手动选择合适的对象。否则，您可能必须添加另一个后处理步骤，例如通过适当的NLP框架使用词性分析。

从字符串生成有效词

1 个答案: