Question

我有一行字符串：

"specificationsinaccordancewithqualityaccreditedstandards"

需要将其分成标记化的单词，例如：

"specifications in accordance with quality accredited standards"

我已尝试nltk＆＃39; word_tokenize但无法转换，

上下文：我正在将PDF文档解析为文本文件，这是我从pdf转换器中获取的文本，将pdf转换为文本我在{{1中使用PDFminer }}

Answer 1

您可以使用trie。 trie是一种允许单词验证的数据结构。
它是一棵树，您可以在其中导航分支以获取有效的前缀，并在您到达完整世界时收到通知。

虽然我从来没有使用它，具体而言，＃34;我找到了这个python实现，datrie。

我的想法是导入datrie，使用它从trie字典生成txt（例如here），然后解析字符串。当您在trie找到匹配项时，每个字符都会读取字符，当您没有合理地找到某个字词时，请将其添加到拆分字符串中。

您可以在trie here on wikipedia或in this video上找到更多信息（这是教会我trie是什么的人。）

Answer 2

您可以使用递归来解决此问题。首先，您需要下载字典txt文件，您可以在此处获取：https://github.com/Ajax12345/My-Python-Projects/blob/master/the_file.txt

dictionary = [i.strip('\n') for i in open('the_file.txt')]
def get_options(scrambled, flag, totals, last):
   if flag:
       return totals

   else:
       new_list = [i for i in dictionary if scrambled.startswith(i)]
       if new_list:

           possible_word = new_list[-1]
           new_totals = totals
           new_totals.append(possible_word)
           new_scrambled = scrambled[len(possible_word):]
           return get_options(new_scrambled, False, new_totals, possible_word)

        else:
            return get_options("", True, totals, '')


s = "specificationsinaccordancewithqualityaccreditedstandards"
print(' '.join(get_options(s, False, [], '')))

输出：

'specifications in accordance with quality accredited standards'

如何将文本行转换为有意义的单词

2 个答案: