当前,我正在尝试转换一个PDF表格,该表格是一个表格并包含打印的文本。我收到“重复”错误,因此我在调用google tesseract模块,并且我得到的输出似乎在复制不应重复的字符时出错。例如,当原稿为111时为1111,当结果为804时为8004
我尝试在pdf2image模块中的convert_from_path调用中弄乱了dpi转换。似乎没有帮助(一直到900 dpi为止)。我尝试弄乱了一些custom_oem_psm_config设置。尽管它确实对我的常规输出文件有所帮助,但并没有减少任何错误。使用OpenCV,我尝试对图像应用一些阈值设置(尽管我不确定100%是否正确应用了图像)。最后,我尝试了将图像裁剪到文本的确切位置,而不是使用边界框,然后进行转换。
这些东西都没有帮助。
建议我尝试获取分辨率更高的原始图像,或者如果无法解决该问题,请建立自己的模型。
请注意,虽然很长,但我仅粘贴了一部分代码。
'''
Part #2 - Recognizing text from the images using OCR
'''
# Variable to get a count of total number of pages
filelimit = image_counter - 1
# Creating a text file to write the output
outfile = "out_text.txt"
# Open the file in append mode so that
# All contents of all images are added to the same file
f = open(outfile, "a+")
#config for pytesseract accuracy
custom_oem_psm_config = r'--oem 3--psm 6'
# Iterate from 1 to total number of pages
for i in range(1, filelimit + 1):
# Set filename to recognize text from
# Again, these files will be:
# page_1.jpg
# page_2.jpg
# ....
# page_n.jpg
filename = "page_" + str(i) + ".jpg"
# pre-process image
img = Image.open(filename)
open_cv_image = np.array(img)
umat_image = cv2.UMat(open_cv_image)
threshold = 125
retval, img = cv2.threshold(umat_image, 12, threshold, 255, cv2.THRESH_BINARY)
img = Image.fromarray(img.get())
img.save(filename)
# Recognize the text as a string in the image using pytesseract
text = str(pytesseract.image_to_string(Image.open(filename), config=custom_oem_psm_config))
# The recognized text is stored in the variable text
# Any string processing may be applied on text
# Here, basic formatting has been done:
# In many PDFs, at line end, if a word can't
# be written fully, a 'hyphen' is added.
# The rest of the word is written in the next line
# Eg: This is a sample text this word here GeeksF-
# orGeeks is half on the first line, remaining on next.
# To remove this, we replace every '-\n' to ''.
text = text.replace('-\n', '')
# Finally, write the processed text to the file.
f.write(text)
# Look back at the start of a file
f.seek(0)
如果有人有尚未解决的任何提示,或者可以给我有关此错误的名称的建议,以便我可以去做更多的研究。