如何在OCRed文本中分离错误组合的单词?

时间:2019-05-09 17:20:47

标签: awk sed ocr text-processing

我有一个长文件的文本,该文件被其他人使用OCRed,其中包含很多无法正确识别间距并且两个单词并排运行的实例(例如:之间的分隔,hasready,所有人)。是否有使用awk,sed之类的相对快速的方法来查找不是单词的字符串并检查它们是否可以分为合法单词?

还是有其他快速修复方法?例如,我注意到Chrome能够将合并的单词标记为拼写错误,当您右键单击时,建议的更正几乎总是我想要的,但是我不知道一种快速解决方案,只是将它们全部自动修复(有成千上万)。

谢谢!

1 个答案:

答案 0 :(得分:1)

马特在修复其他尝试使用命令行工具执行此操作的人时可能会出错,但是如果您有单词词典,则可以对patsplit()使用GNU awk来执行类似的操作,并且多字符RS,以防您的任何文件具有DOS行结尾:

$ cat words
bar
disco
discontent
exchange
experts
foo
is
now
of
tent
winter

$ cat file
now is the freezing winter
of ExPeRtSeXcHaNgE discontent

$ cat tst.awk
BEGIN {
    RS = "\r?\n"
    minSubLgth = 2
    minWordLgth = minSubLgth * 2
}
NR==FNR {
    realWords[tolower($0)]
    next
}
{
    n = patsplit($0,words,"[[:alpha:]]{"minWordLgth",}+",seps)
    printf "%s", seps[0]
    for (i=1; i<=n; i++) {
        word = words[i]
        lcword = tolower(word)
        if ( !(lcword in realWords) ) {
            found = 0
            for (j=length(lcword)-minSubLgth; j>=minSubLgth; j--) {
                head = substr(lcword,1,j)
                tail = substr(lcword,j+1)
                if ( (head in realWords) && (tail in realWords) ) {
                    found = 1
                    break
                }
            }
            word = (found ? "[[[" substr(word,1,j) " " substr(word,j+1) "]]]" : "<<<" word ">>>")
        }
        printf "%s%s", word, seps[i]
    }
    print ""
}

$ awk -f tst.awk words file
now is the <<<freezing>>> winter
of [[[ExPeRtS eXcHaNgE]]] discontent

识别不在单词列表中的不区分大小写的字母字符串,然后迭代地从每个字符串中创建子字符串对,然后查看这些子字符串是否在“ realWords []”中。这会有点慢且近似,并且仅在组合两个单词(而不是三个或三个以上)时才起作用,但也许就足够了。考虑一下该算法,因为它可能是分割子字符串的最佳方法,也可能不是最佳方法(我没有考虑太多),请不要查找少于字母个数的单词(我在上面使用了4个),而不是拆分成少于其他字母个数的子字符串(我在上面使用了2个),您可能会或可能不会想要突出显示realWords[]中没有出现的单词,但不能拆分为(以上freezing)。

FWIW我从https://github.com/dwyl/english-words/blob/master/words_alpha.txt下载了单词列表(您可能想用google搜索一个更好的列表,因为这个列表似乎包含一些非单词,例如wasnll),并且使用问题中文本的版本,并删除一些其他空格,您会看到一些可以捕获的内容,一些无法解决的内容以及一些错误的内容:

$ cat file
I have the textof a long document that was OCRed by someoneelse that contains
a lot ofinstances where the spacingwasn't recognized properly and two words
are run together (ex: divisionbetween, hasalready, everyoneelse). Is there a
relatively quickway using awk, sed, or the like tofind strings that are not
words andcheck if they can separatedintolegitimate words?

Or is there someother quick way to fix them? Forinstance, Inotice that
Chrome is able toflag the combined words asmisspellings and when you right
click, thesuggested correction is pretty much always the oneIwant, but I
don't know a quickway to just auto-fix themall(and there are thousands).

$ awk -f tst.awk words_alpha.txt file
I have the [[[text of]]] a long document that was [[[OC Red]]] by [[[someone else]]] that contains
a lot [[[of instances]]] where the [[[spacing wasn]]]'t recognized properly and two words
are run together (ex: [[[division between]]], [[[has already]]], [[[everyone else]]]). Is there a
relatively [[[quick way]]] using awk, sed, or the like [[[to find]]] strings that are not
words [[[and check]]] if they can <<<separatedintolegitimate>>> words?

Or is there [[[some other]]] quick way to fix them? [[[For instance]]], [[[Ino tice]]] that
Chrome is able [[[to flag]]] the combined words [[[as misspellings]]] and when you right
click, [[[the suggested]]] correction is pretty much always the <<<oneIwant>>>, but I
don't know a [[[quick way]]] to just auto-fix [[[thema ll]]](and there are thousands).

FWIW用了大约半秒的时间在我的[动力不足]笔记本电脑上运行cygwin。