我正在使用tesseract来检测图像和pdf上的文本。我已经在OSX El Capitan上安装了它并且它的工作非常好。另外我把它安装在Ubuntu 16.04服务器上。但它表现不佳。结果在相同图像上是不同的。它在两个系统上的版本相同:Tesseract 4.00 Alpha,Leptonica 1.74.1
OSX
tesseract 4.00.00alpha
leptonica-1.74.1
libjpeg 8d : libpng 1.6.29 : libtiff 4.0.7 : zlib 1.2.5
Ubuntu的
tesseract 4.00.00alpha
leptonica-1.74.1
libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8
我正在使用Wand包装器将image / pdf文件转换为python中的jpeg与imagemagick / libmagick-dev。我想这与jpeg编码有关。为什么它表现得那样?
all_pages = WI(filename=args['image'], resolution=300)
all_pages.compression_quality = 100
#i set the compression type but this doesn't changed anything
#all_pages.compression = 'jpeg2000'
image_jpeg = all_pages.convert('jpg')
single_image = WI(image=image_jpeg.sequence[0])
blob = single_image.make_blob("jpg")
single_image.close()
all_pages.close()
img = PI.open(BytesIO(blob)).convert("L")
greyImg = np.array(img)
# do some image processing
pIm_bw = PI.fromarray(output)
builder = pyocr.builders.LineBoxBuilder()
print("LineBoxBuilder set")
txt = tool.image_to_string(
pIm_bw,
lang=language,
builder=builder
)