Question

我正在尝试从下面给出的图像字符串中提取数字

我从普通文本中提取数字没有问题，但是上面带中的数字似乎是图片中的图片。这是我用来提取数字的代码。

import pytesseract
from PIL import Image

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
img = Image.open(r"C:\Users\UserName\PycharmProjects\COLLEGE PROJ\65.png")
text=pytesseract.image_to_string(img, config='--psm 6')
file = open("c.txt", 'w')
file.write(text)
file.close()
print(text)

我尝试了所有可能的psm（从1到13），它们都只显示一周。如果我仅裁剪数字，则代码有效。但是我的项目需要我从类似的测试条中提取出来。有人可以帮我吗？我已经在项目的这一方面停留了一段时间了。

我已经附上了完整的图片，以防任何人更好地理解问题。

我可以在右侧的文本中提取数字，但是我不能从最左边的周条中提取数字！

Answer 1

首先，您需要通过adaptive-thresholding操作对图像应用bitwise-not。

在adaptive-thresholding之后：

在bitwise-not之后：

要了解有关这些操作的更多信息，请查看Morphological Transformations，Arithmetic Operations和Image Thresholding。

现在我们需要逐列阅读。

因此，要设置逐列读取，我们需要页面分割模式4：

“ 4：假设一列可变大小的文本。” source

现在，当我们阅读：

txt = pytesseract.image_to_string(bnt, config="--psm 4")

结果：

WEEK © +4 hours te complete

5 Software

in the fifth week af this course, we'll learn about tcomputer software. We'll learn about what software actually is and the
.
.
.

我们有很多信息，我们只需要5和6值。

逻辑是：如果当前句子中有WEEK字符串，则获取下一行并打印：

txt = txt.strip().split("\n")
get_nxt_ln = False
for t in txt:
    if t and get_nxt_ln:
        print(t)
        get_nxt_ln = False
    if "WEEK" in t:
        get_nxt_ln = True

结果：

5 Software
: 6 Troubleshooting

现在只获取整数，我们可以使用regular-expression

t = re.sub("[^0-9]", "", t)
print(t)

结果：

5
6

代码：

import re
import cv2
import pytesseract

img = cv2.imread("BWSFU.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thr = cv2.adaptiveThreshold(gry, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
                            cv2.THRESH_BINARY_INV, 11, 2)
bnt = cv2.bitwise_not(thr)
txt = pytesseract.image_to_string(bnt, config="--psm 4")
txt = txt.strip().split("\n")
get_nxt_ln = False
for t in txt:
    if t and get_nxt_ln:
        t = re.sub("[^0-9]", "", t)
        print(t)
        get_nxt_ln = False
    if "WEEK" in t:
        get_nxt_ln = True

Pytesseract没有检测到可能是图片中的图片的数字

1 个答案: