Question

有没有办法使用tesseract ocr检测表。

我正在使用以下c＃代码（charlesw / tesseract）

using (var iter = page.GetIterator())
{
    iter.Begin();
    do
    {
        var blockType = iter.BlockType; // Never equals to Table

    } while (iter.Next(PageIteratorLevel.Word));
}

迭代块并查询BlockType属性，但它永远不会返回PolyBlockType.Table值，即使我的文档中有表格

我也试过设置＆＃34; textord_tabfind_find_tables＆＃34;变量为true，但没有运气。

Answer 1

我只能找到带有PSM 1、2、3、4的表；这对我来说是关键。

我无法在所有文档中找到表格，因此我在Google Doc（含其他文本）中制作了一个表格和两个表格，然后导出为PDF，然后以300和600 dpi转换为PNG。它能够找到我的全角表，但找不到我的半角表。

使用tesserocr Python绑定：

ret = api.SetVariable('textord_tabfind_find_tables', 'true')  # 400: 1
ret = api.SetVariable('textord_tablefind_recognize_tables', 'true') # 400: 0
ret = api.SetVariable('textord_tablefind_show_mark', 'true')  # 400: 0
ret = api.SetVariable('textord_tablefind_show_stats', 'true')   # 400: 0

if args.scrollview:
    api.SetVariable("textord_show_tables", "true")  # launches ScrollView

api.Recognize()
level = RIL.BLOCK           # BLOCK, PARA, SYMBOL, TEXTLINE, WORD
riter = api.GetIterator()
for r in iterate_level(riter, level):
    if args.tablesonly is False or PT_NAME[r.BlockType()] == 'TABLE':
        print('### blocktype={}={} confidence={} txt:\n{}'.format(
            r.BlockType(), PT_NAME[r.BlockType()],
            int(r.Confidence(level)), r.GetUTF8Text(level)))

使用tesseract api进行表检测

1 个答案: