我想在删除正确的数字后获得两位数的图像,以获得良好的准确性。 示例)Original,Modified
图像为PNG文件(52 * 26px),背景颜色为(192,192,192,255),每个数字的颜色不同。
但令人惊讶的是,在删除正确的号码之后,tesseract无法识别这个号码。
结果:
> head(df.j)
Loan Identifier variable df.new_value df.old_value
1: 960974101 Employment Type – B1 Employed or full loan is guaranteed Self-employed
2: 960959708 Employment Type – B1 Employed or full loan is guaranteed Self-employed
3: 960959806 Employment Type – B1 Self-employed Employed or full loan is guaranteed
4: 960973707 Property Postcode LE4 8EE TA1
5: 960974101 Property Postcode FY7 8HN BB2
6: 960959610 Property Postcode RG18 4QS BH9
original:60
left:
答案 0 :(得分:0)
Tesseract在内部执行连接组件分析。它确实尝试将文本块组合在一起,并且可能由于页面中缺少太多字符而导致问题。有页面分割模式,您可以要求tesseract将图像视为单个字符。尝试这种方法它可能会给你所需的结果。