Question

我需要使用pytesseract将带有多个页面的image.tif转录为文本。我有下一个代码：

> From PIL import Image
> Import pytesseract
> Pytesseract.pytesseract.tesseract_cmd = 'C: / Program Files (x86) / Tesseract-
> OCR / tesseract '
> Print (pytesseract.image_to_string (Image.open ('CAMARA.tif'), lang = "spa"))

问题是只提取第一页。我怎样才能提取所有这些？

Answer 1

我可以通过调用下面的方法convert()来解决相同的问题

image = Image.open(imagePath).convert("RGBA")
text = pytesseract.image_to_string(image)
print(text)

Answer 2

我猜你只提到过一张图片＆＃34; camara.tif＆＃34; ，首先，您必须将所有pdf页面转换为图像，您可以看到link这样做。

然后使用pytesseract逐个循环覆盖图像以从图像中提取文本。

Answer 3

我只是偶然发现了同样的问题......你能做的就是直接打电话给tesseract

"-I${MY_INCLUDE_DIR}"

将处理所有页面

$ python test.py 
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 1
Page 2
Page 3

pytesseract和image.tif文件

3 个答案: