我一直在用Tesseract OCR试验拉特利。我可以在图像中找到字符,但是在图像中只能找到粗体字符时遇到了麻烦(知道文档图像中的字符是否为粗体)。我在Tesseract API的另一个问题(Can I use OCR to detect font style (bold, italic)?)中看到了函数WordFontAttributes(),但我无法在Python中实现它。
答案 0 :(得分:0)
在安装tesseract 3.05之前(第4版不支持WordFontAttributes)
from tesserocr import PyTessBaseAPI, RIL, iterate_level
def get_words_info(image_path, tessdata_path):
"""
get path to image and path to tessdata and return dict with info about each word
"""
# api = PyTessBaseAPI(path=tessdata_path)
with PyTessBaseAPI(path=tessdata_path) as api:
api.SetImageFile(image_path)
api.Recognize()
iter = api.GetIterator()
level = RIL.WORD
result = []
for r in iterate_level(iter, level):
element = r.GetUTF8Text(level)
word_attributes = r.WordFontAttributes()
base_line = r.BoundingBox(level)
if element:
word_attributes['word'] = element
word_attributes['position'] = base_line
result.append(word_attributes)
return result