我正在尝试为python-tesseract构建一个OCR扩展,专门处理具有内部结构的数据表(例如,包含行和列的小计和总计,允许用户通过强制执行结构来提高准确性)。
我正在尝试访问置信度tesseract分配给多个结果(例如,来自无约束运行的所有结果以及来自字符限制为[0-9\.]
的运行的所有结果)。
我已经看到了一些有关访问x_wconf
api方法的GetHOCRText
属性的信息,但是无法弄清楚如何从python api访问它。你如何调用/访问这个值?谢谢!
我在OSX 10.10.3上使用python-tesseract 0.9.1和Python 2.7。
答案 0 :(得分:0)
我实际上错了,我在考虑pytesseract,而不是python-tesseract。
如果你去查看API源代码(baseapi_mini.h),你会发现有一些功能听起来很有希望你正在尝试做什么。您感兴趣的部分大约是500行左右。
char* GetUTF8Text();
/**
* Make a HTML-formatted string with hOCR markup from the internal
* data structures.
* page_number is 0-based but will appear in the output as 1-based.
*/
char* GetHOCRText(int page_number);
/**
* The recognized text is returned as a char* which is coded in the same
* format as a box file used in training. Returned string must be freed with
* the delete [] operator.
* Constructs coordinates in the original image - not just the rectangle.
* page_number is a 0-based page index that will appear in the box file.
*/
char* GetBoxText(int page_number);
/**
* The recognized text is returned as a char* which is coded
* as UNLV format Latin-1 with specific reject and suspect codes
* and must be freed with the delete [] operator.
*/
char* GetUNLVText();
/** Returns the (average) confidence value between 0 and 100. */
int MeanTextConf();
/**
* Returns all word confidences (between 0 and 100) in an array, terminated
* by -1. The calling function must delete [] after use.
* The number of confidences should correspond to the number of space-
* delimited words in GetUTF8Text.
*/
int* AllWordConfidences();
/**
* Applies the given word to the adaptive classifier if possible.
* The word must be SPACE-DELIMITED UTF-8 - l i k e t h i s , so it can
* tell the boundaries of the graphemes.
* Assumes that SetImage/SetRectangle have been used to set the image
* to the given word. The mode arg should be PSM_SINGLE_WORD or
* PSM_CIRCLE_WORD, as that will be used to control layout analysis.
* The currently set PageSegMode is preserved.
* Returns false if adaption was not possible for some reason.
*/
为了做到这一点,你将不得不编写自己的包装器。
python-tesseract很不错,因为它可以帮助你快速启动并运行,但这并不是我称之为复杂的东西。您可以阅读源代码并了解它的工作原理,但这里是概要:
将输入图像写入临时文件
在该文件上调用tesseract命令(来自命令行)
返回结果
因此,如果您想做任何特别的事情,这根本不会起作用。
我有一个应用程序,我需要高性能和等待文件写入磁盘所花费的时间,等待tesseract启动并加载图像并处理它,而不是太多。
如果我没记错(我不再访问源代码)我使用ctypes加载tesseract进程,设置图像数据然后调用GetHOCRText方法。然后,当我需要处理另一个图像时,我不必等待tesseract再次加载,我只是设置图像数据并再次调用GetHOCRText。
因此,这不是您问题的确切解决方案,并且它绝对不是您可以使用的代码片段。但希望它能帮助您朝着目标迈进。
这是关于包装外部库的另一个问题:Wrapping a C library in Python: C, Cython or ctypes?