Question

我有pdf文件，其中包含要从中提取文本的扫描图像。我碰到了这篇文章here，其中描述了如何实现。我已经在Linux上广泛使用python，但是在这里我使用Windows 10平台，我想我已经正确安装了所有依赖项。我的代码如下：

from wand.image import Image
from PIL import Image as PI
import pyocr
import pyocr.builders
import io
import glob

tool = pyocr.get_available_tools()[0]
type(tool)
lang = tool.get_available_builders()[0]

req_image = []
final_text = []

files = glob.glob("S:\test_data\*")
print files[0]

image_pdf = Image(filename=files[0], resolution=300)
image_jpeg = image_pdf.convert('jpeg')

for img in image_jpeg.sequence:
    img_page = Image(image=img)
    req_image.append(img_page.make_blob('jpeg'))

for img in req_image:
    txt = tool.image_to_string(
        PI.open(io.BytesIO(img)),
        lang=lang,
        builder=pyocr.builders.TextBuilder()
    )
    final_text.append(txt)

print final_text

运行它时，出现以下错误：

S:\test_data\test.pdf
Traceback (most recent call last):
  File "C:/Users/pbhor/PycharmProjects/test/test.py", line 29, in <module>
    builder=pyocr.builders.TextBuilder()
  File "C:\Users\pbhor\PycharmProjects\test\venv\lib\site-packages\pyocr\tesseract.py", line 365, in image_to_string
    configs=builder.tesseract_configs)
  File "C:\Users\pbhor\PycharmProjects\test\venv\lib\site-packages\pyocr\tesseract.py", line 281, in run_tesseract
    stderr=subprocess.STDOUT)
  File "C:\Python27\Lib\subprocess.py", line 394, in __init__
    errread, errwrite)
  File "C:\Python27\Lib\subprocess.py", line 599, in _execute_child
    args = list2cmdline(args)
  File "C:\Python27\Lib\subprocess.py", line 266, in list2cmdline
    needquote = (" " in arg) or ("\t" in arg) or not arg
TypeError: argument of type 'type' is not iterable

Process finished with exit code 1

我在这里做错什么了？

Answer 1

    txt = tool.image_to_string(
    Image.open(img),
    lang=lang,
    builder=pyocr.builders.TextBuilder()
    )

TypeError：从扫描的PDF中提取文本时，不能重复使用'type'类型的参数

1 个答案: