Py tesseract中的复制错误:1111代替111或8004代替804

时间:2019-09-14 04:54:03

标签: ocr python-tesseract

当前,我正在尝试转换一个PDF表格,该表格是一个表格并包含打印的文本。我收到“重复”错误,因此我在调用google tesseract模块,并且我得到的输出似乎在复制不应重复的字符时出错。例如,当原稿为111时为1111,当结果为804时为8004

我尝试在pdf2image模块中的convert_from_path调用中弄乱了dpi转换。似乎没有帮助(一直到900 dpi为止)。我尝试弄乱了一些custom_oem_psm_config设置。尽管它确实对我的常规输出文件有所帮助,但并没有减少任何错误。使用OpenCV,我尝试对图像应用一些阈值设置(尽管我不确定100%是否正确应用了图像)。最后,我尝试了将图像裁剪到文本的确切位置,而不是使用边界框,然后进行转换。

这些东西都没有帮助。

建议我尝试获取分辨率更高的原始图像,或者如果无法解决该问题,请建立自己的模型。

请注意,虽然很长,但我仅粘贴了一部分代码。


''' 
Part #2 - Recognizing text from the images using OCR 
'''

# Variable to get a count of total number of pages
filelimit = image_counter - 1

# Creating a text file to write the output
outfile = "out_text.txt"

# Open the file in append mode so that
# All contents of all images are added to the same file
f = open(outfile, "a+")

#config for pytesseract accuracy
custom_oem_psm_config = r'--oem 3--psm 6'


# Iterate from 1 to total number of pages
for i in range(1, filelimit + 1):
    # Set filename to recognize text from
    # Again, these files will be:
    # page_1.jpg
    # page_2.jpg
    # ....
    # page_n.jpg
    filename = "page_" + str(i) + ".jpg"

# pre-process image
    img = Image.open(filename)
    open_cv_image = np.array(img)
    umat_image = cv2.UMat(open_cv_image)
    threshold = 125
    retval, img = cv2.threshold(umat_image, 12, threshold, 255, cv2.THRESH_BINARY)
    img = Image.fromarray(img.get())
    img.save(filename)


# Recognize the text as a string in the image using pytesseract
    text = str(pytesseract.image_to_string(Image.open(filename), config=custom_oem_psm_config))

# The recognized text is stored in the variable text
# Any string processing may be applied on text
# Here, basic formatting has been done:
# In many PDFs, at line end, if a word can't
# be written fully, a 'hyphen' is added.
# The rest of the word is written in the next line
# Eg: This is a sample text this word here GeeksF-
# orGeeks is half on the first line, remaining on next.
# To remove this, we replace every '-\n' to ''.
    text = text.replace('-\n', '')

# Finally, write the processed text to the file.
    f.write(text)

# Look back at the start of a file
f.seek(0)

如果有人有尚未解决的任何提示,或者可以给我有关此错误的名称的建议,以便我可以去做更多的研究。

0 个答案:

没有答案