Question

当前，我正在尝试转换一个PDF表格，该表格是一个表格并包含打印的文本。我收到“重复”错误，因此我在调用google tesseract模块，并且我得到的输出似乎在复制不应重复的字符时出错。例如，当原稿为111时为1111，当结果为804时为8004

我尝试在pdf2image模块中的convert_from_path调用中弄乱了dpi转换。似乎没有帮助（一直到900 dpi为止）。我尝试弄乱了一些custom_oem_psm_config设置。尽管它确实对我的常规输出文件有所帮助，但并没有减少任何错误。使用OpenCV，我尝试对图像应用一些阈值设置（尽管我不确定100％是否正确应用了图像）。最后，我尝试了将图像裁剪到文本的确切位置，而不是使用边界框，然后进行转换。

这些东西都没有帮助。

建议我尝试获取分辨率更高的原始图像，或者如果无法解决该问题，请建立自己的模型。

请注意，虽然很长，但我仅粘贴了一部分代码。


''' 
Part #2 - Recognizing text from the images using OCR 
'''

# Variable to get a count of total number of pages
filelimit = image_counter - 1

# Creating a text file to write the output
outfile = "out_text.txt"

# Open the file in append mode so that
# All contents of all images are added to the same file
f = open(outfile, "a+")

#config for pytesseract accuracy
custom_oem_psm_config = r'--oem 3--psm 6'


# Iterate from 1 to total number of pages
for i in range(1, filelimit + 1):
    # Set filename to recognize text from
    # Again, these files will be:
    # page_1.jpg
    # page_2.jpg
    # ....
    # page_n.jpg
    filename = "page_" + str(i) + ".jpg"

# pre-process image
    img = Image.open(filename)
    open_cv_image = np.array(img)
    umat_image = cv2.UMat(open_cv_image)
    threshold = 125
    retval, img = cv2.threshold(umat_image, 12, threshold, 255, cv2.THRESH_BINARY)
    img = Image.fromarray(img.get())
    img.save(filename)


# Recognize the text as a string in the image using pytesseract
    text = str(pytesseract.image_to_string(Image.open(filename), config=custom_oem_psm_config))

# The recognized text is stored in the variable text
# Any string processing may be applied on text
# Here, basic formatting has been done:
# In many PDFs, at line end, if a word can't
# be written fully, a 'hyphen' is added.
# The rest of the word is written in the next line
# Eg: This is a sample text this word here GeeksF-
# orGeeks is half on the first line, remaining on next.
# To remove this, we replace every '-\n' to ''.
    text = text.replace('-\n', '')

# Finally, write the processed text to the file.
    f.write(text)

# Look back at the start of a file
f.seek(0)

如果有人有尚未解决的任何提示，或者可以给我有关此错误的名称的建议，以便我可以去做更多的研究。

Py tesseract中的复制错误：1111代替111或8004代替804

0 个答案: