Question

我正在使用NLTK来分析已被OCR的语料库。我是NLTK的新手。大多数OCR都很好 - 但有时我遇到明显是垃圾的线条。例如：oomfi ow Ba wmnondmam BE wBwHo<oBoBm. Bowman as: Ham: 8 ooww om $5

我想从我的分析中识别（并过滤掉）这些行。

NLP从业者如何处理这种情况？类似于：如果句子中70％的单词不在wordnet中，则丢弃。或者，如果NLTK无法识别80％的单词的词性，那么丢弃？什么算法适用于此？是否有＆＃34;黄金标准＆＃34;这样做的方法？

Answer 1

使用n-gram可能是您的最佳选择。您可以使用谷歌n-gram，也可以使用n-grams built into nltk。我们的想法是创建一个语言模型，看看任何给定句子的概率。您可以定义概率阈值，并删除所有低于它的分数。任何合理的语言模型都会为例句提供非常低的分数。

如果您认为某些单词可能只是略有损坏，您可以在使用n-gram进行测试之前尝试spelling correction。

编辑：这是一些用于执行此操作的示例nltk代码：

import math
from nltk import NgramModel
from nltk.corpus import brown
from nltk.util import ngrams
from nltk.probability import LidstoneProbDist

n = 2
est = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(n, brown.words(categories='news'), estimator=est)

def sentenceprob(sentence):
    bigrams = ngrams(sentence.split(), n)
    sentence = sentence.lower()
    tot = 0
    for grams in bigrams:
        score = lm.logprob(grams[-1], grams[:-1])
        tot += score
    return tot

sentence1 = "This is a standard English sentence"
sentence2 = "oomfi ow Ba wmnondmam BE wBwHo<oBoBm. Bowman as: Ham: 8 ooww om $5"

print sentenceprob(sentence1)
print sentenceprob(sentence2)

结果如下：

>>> python lmtest.py
  42.7436688972
  158.850086668

越低越好。（当然，您可以使用参数）。

如何使用NLTK检查不可读的OCR文本

1 个答案: