无法从图像中提取单词

时间:2018-06-20 08:53:42

标签: python python-3.x web-scraping python-imaging-library python-tesseract

我用pythonpytesseract编写了一个脚本,以从图像中提取单词。该图像中只有一个单词TOOLS可用,这就是我所追求的。当前,我的下面脚本给我错误的输出WIS。我该怎么办才能得到短信?

Link to that image

这是我的脚本:

import requests, io, pytesseract
from PIL import Image

response = requests.get('http://facweb.cs.depaul.edu/sgrais/images/Type/Tools.jpg')
img = Image.open(io.BytesIO(response.content))
img = img.resize([100,100], Image.ANTIALIAS)
img = img.convert('L')
img = img.point(lambda x: 0 if x < 170 else 255)
imagetext = pytesseract.image_to_string(img)
print(imagetext)
# img.show()

这是我运行上述脚本时修改后的图像的状态:

enter image description here

我得到的输出:

WIS

预期输出:

TOOLS

2 个答案:

答案 0 :(得分:11)

关键是使图像转换与tesseract功能匹配。您的主要问题是该字体不是常用字体。您只需要

from PIL import Image, ImageEnhance, ImageFilter

response = requests.get('http://facweb.cs.depaul.edu/sgrais/images/Type/Tools.jpg')
img = Image.open(io.BytesIO(response.content))

# remove texture
enhancer = ImageEnhance.Color(img)
img = enhancer.enhance(0)   # decolorize
img = img.point(lambda x: 0 if x < 250 else 255) # set threshold
img = img.resize([300, 100], Image.LANCZOS) # resize to remove noise
img = img.point(lambda x: 0 if x < 250 else 255) # get rid of remains of noise
# adjust font weight
img = img.filter(ImageFilter.MaxFilter(11)) # lighten the font ;)
imagetext = pytesseract.image_to_string(img)
print(imagetext)

瞧,

TOOLS

被认可。

答案 1 :(得分:0)

实现的关键问题在这里:

ID TERR YOB  DOB N
1  A1   1982 148 1
2  A1   1982 148 1
3  A1   1982 148 1
4  A1   1982 185 2
5  A1   1982 185 2
6  A1   1985 137 1
7  A1   1985 137 1
8  BIAN 1989 132 1
9  BIAN 1989 132 1
10 BIAN 1989 132 1
11 BIAN 1992 155 1
12 BIAN 1992 155 1
13 BIAN 1992 155 1
14 BIAN 1992 254 2
15 BIAN 1992 254 2
16 BIAN 1992 254 2
17 BIAN 1994 164 1
18 BIAN 1994 164 1
19 GATE 1998 119 1
20 GATE 1998 119 1
21 GATE 1998 172 2
22 GATE 1998 172 2
23 GATE 1998 172 2
24 GATE 1999 153 1
25 GATE 1999 153 1

您可以尝试使用不同的大小和不同的阈值:

img = img.resize([100,100], Image.ANTIALIAS)
img = img.point(lambda x: 0 if x < 170 else 255)

看看什么对您有用。

我建议您在保持原始比例的同时调整图像大小。您还可以尝试使用import requests, io, pytesseract from PIL import Image from PIL import ImageFilter response = requests.get('http://facweb.cs.depaul.edu/sgrais/images/Type/Tools.jpg') img = Image.open(io.BytesIO(response.content)) filters = [ # ('nearest', Image.NEAREST), ('box', Image.BOX), # ('bilinear', Image.BILINEAR), # ('hamming', Image.HAMMING), # ('bicubic', Image.BICUBIC), ('lanczos', Image.LANCZOS), ] subtle_filters = [ # 'BLUR', # 'CONTOUR', 'DETAIL', 'EDGE_ENHANCE', 'EDGE_ENHANCE_MORE', # 'EMBOSS', 'FIND_EDGES', 'SHARPEN', 'SMOOTH', 'SMOOTH_MORE', ] for name, filt in filters: for subtle_filter_name in subtle_filters: for s in range(220, 250, 10): for threshold in range(250, 253, 1): img_temp = img.copy() img_temp.thumbnail([s,s], filt) img_temp = img_temp.convert('L') img_temp = img_temp.point(lambda x: 0 if x < threshold else 255) img_temp = img_temp.filter(getattr(ImageFilter, subtle_filter_name)) imagetext = pytesseract.image_to_string(img_temp) print(s, threshold, name, subtle_filter_name, imagetext) with open('thumb%s_%s_%s_%s.jpg' % (s, threshold, name, subtle_filter_name), 'wb') as g: img_temp.save(g)

的替代方法

到目前为止最好:img_temp.convert('L')TWls

您可以尝试手动操作图像,看看是否可以找到一些可以提供更好输出的编辑(例如http://gimpchat.com/viewtopic.php?f=8&t=1193

通过提前知道字体,您也可能会获得更好的结果。