Question

我正在尝试利用Pytesseract进行一些非常基本的字符识别。当我在Linux上运行以下代码时，输出有意义：

import matplotlib.pyplot as plt
import pandas as pd

import sys
import pytesseract
# need to add tesseract install location to path in windows.
if sys.platform == 'win32':
    tesseract_path = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
    pytesseract.pytesseract.tesseract_cmd = tesseract_path

img = pd.read_csv('https://www.dropbox.com/s/fcs5bcmy73j75o0/two.csv?dl=1').values
fig,ax=plt.subplots()
ax.imshow(img.astype(float),cmap='gray')

print('identified as {}'.format(pytesseract.image_to_string(img.astype(float))))

但是在Windows中对pytesseract.image_to_string的相同调用将返回一个空字符串：

代码在Python 3环境中的两台计算机上都执行。

在Windows机器上安装Tesseract时，有没有一个明显的步骤可以解释这种现象？

Windows中的Tesseract是使用以下安装程序安装的： https://github.com/UB-Mannheim/tesseract/wiki

在Linux中，我只是使用了： yum install tesseract

Answer 1

我遇到了同样的问题，结果发现，如果我将tesseract_cmd链接设置为Tesseract-ocr v5.0文件夹（我是从here安装的），它就可以正常工作。

pytesseract.pytesseract.tesseract_cmd = 'C:\\Users\\minh.nguyen\\AppData\\Local\\Tesseract-OCR\\tesseract.exe'

请注意，我使用tesseract v5而不是v4.1，因为它具有更好的效果。

Pytesseract在Windows和Linux中的行为有所不同

1 个答案: