Pytesseract提高OCR精度

时间:2020-09-28 09:14:52

标签: python python-3.x ocr tesseract pytesser

我想从python中的图像中提取文本。为此,我选择了pytesseract。当我尝试从图像中提取文本时,结果并不令人满意。我还经历了this并实现了下面列出的所有技术。但是,它似乎运行不佳。

图片:

enter image description here

代码:

import pytesseract
import cv2
import numpy as np

img = cv2.imread('D:\\wordsimg.png')

img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)

img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

kernel = np.ones((1,1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)

img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
    
txt = pytesseract.image_to_string(img ,lang = 'eng')

txt = txt[:-1]

txt = txt.replace('\n',' ')

print(txt)

输出:

t hose he large form might light another us should took mountai house n story important went own own thought girl over family look some much ask the under why miss point make mile grow do own school was 

即使1个不需要的空间也可能花费我很多钱。我希望结果是100%准确的。任何帮助,将不胜感激。谢谢!

1 个答案:

答案 0 :(得分:1)

我将调整大小从1.2更改为2,并删除了所有预处理。 psm 11和psm 12取得了不错的结果

div

<div class="firstThreeThings"> <div class="thing">This is a thing</div> <div class="thing">This is a thing</div> <div class="thing">This is a thing</div> </div> <div class="otherThings"> <div class="thing">This is a thing</div> <div class="thing">This is a thing</div> <div class="thing">This is a thing</div> <div class="thing">This is a thing</div> </div> 行使用string interpolation (%) operatorimport pytesseract import cv2 import numpy as np img = cv2.imread('wavy.png') # img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC) img = cv2.resize(img, None, fx=2, fy=2) img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) kernel = np.ones((1,1), np.uint8) # img = cv2.dilate(img, kernel, iterations=1) # img = cv2.erode(img, kernel, iterations=1) # img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1] cv2.imwrite('thresh.png', img) pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe' for psm in range(6,13+1): config = '--oem 3 --psm %d' % psm txt = pytesseract.image_to_string(img, config = config, lang='eng') print('psm ', psm, ':',txt) 替换为整数(psm)。我不确定config = '--oem 3 --psm %d' % psm的功能,但是我已经习惯了使用它。答案结尾处的%d上的更多信息。

oem

psm是页面分割模式的缩写。我不确定是什么不同的模式。您可以从描述中了解代码是什么。您可以从psm 11 : those he large form might light another us should name took mountain story important went own own thought girl over family look some much ask the under why miss point make mile grow do own school was psm 12 : those he large form might light another us should name took mountain story important went own own thought girl over family look some much ask the under why miss point make mile grow do own school was

获取列表
psm