Question

我想使用PyOCR从图像中提取泰文，但我无法打印字符串。

这是代码。

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from PIL import Image
import sys
import pyocr
import pyocr.builders

tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
print("Will use tool '%s'" % (tool.get_name()))

langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[3]
print("Will use lang '%s'" % (lang))

txt = tool.image_to_string(
    Image.open('test2.png'),
    lang=lang,
    builder=pyocr.builders.TextBuilder()
)

print(txt)

这给了我这个错误。

回溯（最近一次通话最后一次）：文件“ tess.py”，第29行，在 print（txt）文件“ C：\ Python34 \ lib \ encodings \ cp437.py”，第19行，编码返回codecs.charmap_encode（input，self.errors，encoding_map）[0] UnicodeEncodeError：“ charmap”编解码器无法对以下字符进行编码位置0-2：字符映射到

这是pytesseract的代码形式，它给出了相同的错误。

from PIL import Image
import pytesseract

text = pytesseract.image_to_string(Image.open('test2.png'), lang='tha')
print(text)

尝试使用utf-8对其进行编码后。它给了我这个输出。

b'\ xe0 \ xb8 \ x99 \ xe0 \ xb8 \ xb2 \ xe0 \ xb8 \ xa2 \ xe0 \ xb8 \ xa0 \ xe0 \ xb8 \ x84 \ xe0 \ xb8 \ x9e \ xe0 \ xb8 \ x87 \ xe0 \ xb8 \ xa9 \ xe0 \ xb9 \ x8c \ xe0 \ xb8 \ xaa \ xe0 \ xb8 \ xad \ xe0 \ xb8 \ x99 \ xe0 \ xb9 \ x80 \ xe0 \ xb8 \ xad \ xe0 \ xb8 \ x81'

这是我使用的图像。

无法同时使用Pyocr和pytesseract打印图像中的字符串提取

0 个答案: