Question

1。）无法直接读取pdf数据。为什么会这样？

2。）我已将每一页都保存为图像，然后使用tesseract识别文本。

3。）由于背景中出现水印，因此无法正确识别文本。

4。）去除水印（需要一般解决方案） pdf为：https://drive.google.com/open?id=1pXJSdvYoIVfdTog14sOhDUmxAKJTBYWd

1。）为了直接阅读pdf，我使用了PyPDF2和pdftotext，但是它们都返回了一个空列表。 2）我已经将每个pdf页面转换为图像，然后将该图像提供给Tesseract以识别文本，但是水印在识别文本时造成了问题。

#Store all the pages of the pdf in a variable

pages = convert_from_path('sample.pdf', 500)

#Counter to store images of each page of PDF to image

image_counter = 1

for page in pages:
    filename = "page_" + str(image_counter)+".jpg"
    # Save the image of the page in system
    page.save(filename, 'JPEG')

    #Incrementing the image counter variable
    image_counter += 1


output_file = open("output.txt", 'a')

for i in range(1, image_counter):
    filename = "page_"+str(i)+".jpg"
    img = cv2.imread(filename)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    gray = cv2.adaptiveThreshold(gray, 255, 
.ADAPTIVE_THRESH_GAUSSIAN_C, 
                             cv2.THRESH_BINARY, 115, 1)
cv2.imshow("Processed image", gray)
cv2.waitKey(0)
cv2.destroyAllWindows()
text = str(pytesseract.image_to_string(gray))
print(text)
output_file.write(text)

我需要输出国家，地址，邮政编码，抵押权人的银行名称，并且通常我需要从阅读的文本中输出特定的详细信息。所以我面临的所有问题是：

1。）不能直接读取pdf，需要先将其转换为图像，然后再将其输入到tesseract中以识别文本。

2。）水印使其无法正确读取文本。

3。）为了取出文档的特定字段，最好是什么？我应该选择正则表达式还是在整个文档中查找标题，然后从中获取详细信息或其他方法？

请帮助！！！

如何使用python从pdf的背景中删除水印？

0 个答案: