我有pdf文件,其中包含要从中提取文本的扫描图像。我碰到了这篇文章here,其中描述了如何实现。我已经在Linux上广泛使用python,但是在这里我使用Windows 10平台,我想我已经正确安装了所有依赖项。我的代码如下:
from wand.image import Image
from PIL import Image as PI
import pyocr
import pyocr.builders
import io
import glob
tool = pyocr.get_available_tools()[0]
type(tool)
lang = tool.get_available_builders()[0]
req_image = []
final_text = []
files = glob.glob("S:\test_data\*")
print files[0]
image_pdf = Image(filename=files[0], resolution=300)
image_jpeg = image_pdf.convert('jpeg')
for img in image_jpeg.sequence:
img_page = Image(image=img)
req_image.append(img_page.make_blob('jpeg'))
for img in req_image:
txt = tool.image_to_string(
PI.open(io.BytesIO(img)),
lang=lang,
builder=pyocr.builders.TextBuilder()
)
final_text.append(txt)
print final_text
运行它时,出现以下错误:
S:\test_data\test.pdf
Traceback (most recent call last):
File "C:/Users/pbhor/PycharmProjects/test/test.py", line 29, in <module>
builder=pyocr.builders.TextBuilder()
File "C:\Users\pbhor\PycharmProjects\test\venv\lib\site-packages\pyocr\tesseract.py", line 365, in image_to_string
configs=builder.tesseract_configs)
File "C:\Users\pbhor\PycharmProjects\test\venv\lib\site-packages\pyocr\tesseract.py", line 281, in run_tesseract
stderr=subprocess.STDOUT)
File "C:\Python27\Lib\subprocess.py", line 394, in __init__
errread, errwrite)
File "C:\Python27\Lib\subprocess.py", line 599, in _execute_child
args = list2cmdline(args)
File "C:\Python27\Lib\subprocess.py", line 266, in list2cmdline
needquote = (" " in arg) or ("\t" in arg) or not arg
TypeError: argument of type 'type' is not iterable
Process finished with exit code 1
我在这里做错什么了?
答案 0 :(得分:0)
txt = tool.image_to_string(
Image.open(img),
lang=lang,
builder=pyocr.builders.TextBuilder()
)