如何使用pytesseract从薪水单图像中提取指定的文本

时间:2019-11-15 04:45:57

标签: python-3.x deep-learning computer-vision ocr tesseract

我是tesseract OCR的新手,我有一堆工资单图像,我想自动从工资单中提取日期,请帮助我怎么做,

首先,我试图从一张工资单中提取数据,它显示错误:

import cv2
import pytesseract
img = cv2.imread(r'E:/Receipts/Receipts/0a0ebd53.jpeg')
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract.exe'
TESSDATA_PREFIX='C:/Program Files/Tesseract-OCR/tessdata'
print(pytesseract.image_to_string(img))
# OR explicit beforehand converting
print(pytesseract.image_to_string(Image.fromarray(img))) 

错误:

200         }
    201 
--> 202         run_tesseract(**kwargs)
    203         filename = kwargs['output_filename_base'] + os.extsep + extension
    204         with open(filename, 'rb') as output_file:

~\Anaconda3\lib\site-packages\pytesseract\pytesseract.py in run_tesseract(input_filename, output_filename_base, extension, lang, config, nice)
    176 
    177     if status_code:
--> 178         raise TesseractError(status_code, get_errors(error_string))
    179 
    180     return True

TesseractError: (1, 'Error opening data file C:\\Program Files (x86)\\Tesseract-OCR\\eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'eng\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')

请帮助我解决此错误,还请提供深度学习模型建议。

1 个答案:

答案 0 :(得分:0)

请阅读带有PIL库的图像,然后将图像对象传递给image_to_string(img_obj),如下所示。

from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:/Program Files/TesseractOCR/tesseract.exe"
image_obj = Image.open(image_path)
print(pytesseract.image_to_string(image_obj))