Question

我正在使用NLTK来处理从PDF文件中提取的一些文本。我可以完整地恢复文本，但是有很多实例没有捕获单词之间的空格，所以我得到ifI而不是if I或thatposition而不是{{{{{} 1}}，或that position代替andhe's。

我的问题是：如何使用NLTK查找它无法识别/未学习的单词，并查看是否存在更有可能发生的“附近”单词组合？是否有更优雅的方式来实现这种检查，而不是简单地通过无法识别的单词，一次一个字符，拆分它，看看它是否有两个可识别的单词？

Answer 1

我建议你考虑使用pyenchant代替，因为对于这类问题它是一个更强大的解决方案。你可以下载pyenchant here。以下是安装后如何获得结果的示例：

>>> text = "IfI am inthat position, Idon't think I will."  # note the lack of spaces
>>> from enchant.checker import SpellChecker
>>> checker = SpellChecker("en_US")
>>> checker.set_text(text)
>>> for error in checker:
    for suggestion in error.suggest():
        if error.word.replace(' ', '') == suggestion.replace(' ', ''):  # make sure the suggestion has exact same characters as error in the same order as error and without considering spaces
            error.replace(suggestion)
            break
>>> checker.get_text()
"If I am in that position, I don't think I will."  # text is now fixed

使用NLTK对来自OCR的未分裂单词进行标记

1 个答案: