Question

I am trying to recognize the text in a captcha and it is not possible for me. I am using python3, openCv and tesseract.

The simplified code is:

import cv2                                                           
from pytesseract import *

img_path = "path"

img = cv2.imread(img_path)
img = cv2.resize(img, None, fx=2, fy=2, interpolation=cv2.INTER_LINEAR)
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

pytesseract.image_to_string(img)

I think I should remove the color lines first, then leave the text alone, and maybe change the brightness and contrast. What filter could apply?

These are some images to recognize.

Answer 1

要使用pytesseract-ocr识别验证码文本，您可以执行以下操作。

准备自定义train_set来训练您的tesseract实例以识别特定字体 [可选]
验证码图像需要进行一些预处理（例如*应用黑白滤镜>缩放（向上）>模糊>形态转换+自适应阈值*）以增强文本部分并减少噪点/线条。
用于消除线条：在示例图像中，只能看到黑色的文本，而没有黑线，因此您可以使用PIL或OpenCV轻松地将每个非黑色像素转换为白色，甚至可以利用Hough Line Transform等特定算法来检测和删除线。

您可以从OpenCV网站上的官方文档和教程中了解所有这些过滤器和算法。

Tesseract can not recognize captcha text

1 个答案: