Question

我有OCR扫描了大量文档，需要在扫描的文件中识别关键字。问题是，因为OCR不可靠 - 例如单词＆＃34; SUBSCRIPTION＆＃34;可能最终成为＆＃34; SUBSCR | P || ON＆＃34; - 我需要搜索 near match 而不是完整匹配。

有谁知道如何在文件中搜索单词＆＃34; SUBSCRIPTION＆＃34;如果找到80％匹配，则返回true？

Answer 1

看看宝石Amatch，发现here。这个gem实现了几种距离算法。另外，请阅读其他answer关于Levenshtein和Jaro距离算法之间的区别，并检查哪一个更适合您。

TL; DR，这是一个小小的片段，可以帮助您开始使用Amatch gem来解决您的问题。

'subscription'.levenshtein_similar('SUBSCR|P||ON') #=> 0.0
'SUBSCRIPTION'.levenshtein_similar('SUBSCR|P||ON') #=> 0.75
'subscription'.jaro_similar('SUBSCR|P||ON')        #=> 0.83
'SUBSCRIPTION'.jaro_similar('SUBSCR|P||ON')        #=> 0.83
'subscription'.jarowinkler_similar('SUBSCR|P||ON') #=> 0.9
'SUBSCRIPTION'.jarowinkler_similar('SUBSCR|P||ON') #=> 0.9

如果要评估给定文本是否出现任何单词，请尝试以下操作：

def occurs?(text, target_word)
  text_words = text.split(' ') # Splits the text into an array of words.
  text_words.each do |word|
    return true if word.jaro_similar(target_word) > 0.8
  end
  false
end

example_text = 'This text has the word SUBSCR|P||ON malformed.'
other_text = 'This text does not.'

occurs?(example_text, 'SUBSCRIPTION') #=> true
occurs?(other_text, 'SUBSCRIPTION')   #=> false

请注意，如果您愿意，也可以将方法#downcase调用到文本字词。您必须先解析原始文件的文本内容。希望这有帮助！

Ruby - 搜索相似单词的文件

1 个答案: