如何从文本中删除OCR工件?

时间:2015-06-30 11:12:26

标签: algorithm ocr text-processing information-retrieval text-segmentation

OCR生成的文本有时带有工件,例如:

  

DiesegrundsätzlicheVe r b o r g e n a Got Got Got is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is

虽然并不罕见,但字母之间的间距被用作强调(可能是由于早期的印刷机限制),但这对于检索任务是不利的。

如何将上述文本转换为更多,比如规范的形式,例如:

  

DiesegrundsätzlicheVerborgenheitGottes,die sich nur demNachfolgeröffnet,ist mit dem Messiasgeheimnis gemeint

对于大量文本,这可以有效地完成吗?

一个想法是连接整个字符串(跳过猜测,其中字边界是),然后在其上运行文本分段算法,可能与此类似:http://norvig.com/ngrams/

1 个答案:

答案 0 :(得分:2)

如果您有目标语言的字典,并且所有间隔开的单词由只有一个单词组成,那么它很简单:只需扫描文本,查找最大值 - 间隔开的单个字母的长度运行,如果它们存在,则用单个相应的字典单词替换它们(否则保持不变)。

唯一真正的难点是像m i t d e m这样的字符串对应两个或多个单独的单词。一个简单的方法就是贪婪地啃掉"出现在字典中的前缀,但这可能导致次优结果,特别是对于任何字典字符串都不对应的后缀,即使断点的不同选择有效(例如b e i m A r z t赢了&如果你从前面贪婪地抓住bei而不是beim,那就行了起来。幸运的是,有一种简单的线性时间DP方法可以做得更好 - 甚至可以在单词上加入权重,这有助于在有多个单词的情况下获得最可能的分解。给定一个字符串S [1 .. n](删除空格),我们将计算f(i),即对于所有1< = i< = n,S的长度为i前缀的最佳分解的得分:

f(0) = 0
f(i) = max over all 0 <= j < i of f(j) + dictScore(S[j+1 .. i])
然后,

f(n)将是整个字符串的最佳分解的得分。如果您将字典中存在的单词的dictScore(T)设置为1,而对于不包含的单词则将0设置为0,您将获得尽可能多的单词分解;如果你将dictScore(T)设置为,例如-1,对于字典中存在的单词为-2,而对于不为单词的单词,则为-2,你将尽可能少地分解为单词。你也可以选择更高的分数来获得更高的分数&#34;可能&#34;词语的

计算这些分数后,您可以返回DP矩阵以重建与最大分数相对应的分解。