递归tesseract命令

时间:2018-02-22 12:20:48

标签: bash batch-processing tesseract batch-rename

我有很多tif文件(1页= 1个tif文件),我想将它们转换为单个,多重,大的tif,然后在其上启动tesseract(对于ocr)并获得pdf作为结果。< / p>

我的文件以这种方式命名:

    $ ls
    115_XRTL_000_001.tif  115_XRTL_000_004.tif  115_XRTL_000_007.tif  115_XRTL_000_010.tif
    115_XRTL_000_002.tif  115_XRTL_000_005.tif  115_XRTL_000_008.tif  115_XRTL_000_011.tif
    115_XRTL_000_003.tif  115_XRTL_000_006.tif  115_XRTL_000_009.tif

我使用以下代码:

    for f in *.tif; do base=${f%_*}; tiffcp $base* ../pdfs/$base.tif; tesseract ../pdfs/$base.tif -l fra ../pdfs/$base pdf; rm -v ../pdfs/$base.tif; done

除了带有tesseract的部分外,它的效果很好。它创建了一个循环,并且tesseract在同一个大的tif文件上无法工作。

日志:

$ for f in *.tif; do base=${f%_*}; tiffcp $base* ../../pdfs/$base.tif; tesseract ../../pdfs/$base.tif -l fra ../../pdfs/$base pdf; rm -v ../../pdfs/$base.tif; done
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Warning. Invalid resolution 30000 dpi. Using 70 instead.
Estimating resolution as 650
OSD: Weak margin (5.97) for 122 blob text block, but using orientation anyway: 0
Page 2
Warning. Invalid resolution 30000 dpi. Using 70 instead.
Estimating resolution as 329
Page 3
Warning. Invalid resolution 30000 dpi. Using 70 instead.
Estimating resolution as 346
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
Page 4
Warning. Invalid resolution 30000 dpi. Using 70 instead.
Estimating resolution as 371
Page 5
Warning. Invalid resolution 30000 dpi. Using 70 instead.
Estimating resolution as 375
Page 6
Warning. Invalid resolution 30000 dpi. Using 70 instead.
Estimating resolution as 368
Page 7
Warning. Invalid resolution 30000 dpi. Using 70 instead.
Estimating resolution as 370
Detected 26 diacritics
Page 8
Warning. Invalid resolution 30000 dpi. Using 70 instead.
Estimating resolution as 368
Page 9
Warning. Invalid resolution 30000 dpi. Using 70 instead.
Estimating resolution as 348
Page 10
Warning. Invalid resolution 30000 dpi. Using 70 instead.
Estimating resolution as 376
Page 11
Warning. Invalid resolution 30000 dpi. Using 70 instead.
Estimating resolution as 354
'../../pdfs/115_XRTL_000.tif' supprimé
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Warning. Invalid resolution 30000 dpi. Using 70 instead.
Estimating resolution as 650
OSD: Weak margin (5.97) for 122 blob text block, but using orientation anyway: 0
Page 2
Warning. Invalid resolution 30000 dpi. Using 70 instead.
Estimating resolution as 329
Page 3
Warning. Invalid resolution 30000 dpi. Using 70 instead.
Estimating resolution as 346
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
Page 4
Warning. Invalid resolution 30000 dpi. Using 70 instead.
Estimating resolution as 371
Page 5
Warning. Invalid resolution 30000 dpi. Using 70 instead.
Estimating resolution as 375
^C

如何阻止循环?

另一个问题是tif文件位于不同的文件夹中。

例如,如果我带两个文件夹(我有更多):

$ ls -R
.:
115_XRTL_000  128_XRTL_001

./115_XRTL_000:
115_XRTL_000_001.tif  115_XRTL_000_004.tif  115_XRTL_000_007.tif  115_XRTL_000_010.tif
115_XRTL_000_002.tif  115_XRTL_000_005.tif  115_XRTL_000_008.tif  115_XRTL_000_011.tif
115_XRTL_000_003.tif  115_XRTL_000_006.tif  115_XRTL_000_009.tif

./128_XRTL_001:
128_XRTL_001_001.tif  128_XRTL_001_005.tif  128_XRTL_001_009.tif  128_XRTL_001_013.tif
128_XRTL_001_002.tif  128_XRTL_001_006.tif  128_XRTL_001_010.tif  128_XRTL_001_014.tif
128_XRTL_001_003.tif  128_XRTL_001_007.tif  128_XRTL_001_011.tif
128_XRTL_001_004.tif  128_XRTL_001_008.tif  128_XRTL_001_012.tif

我希望能够启动一个命令,该命令将递归地应用于每个文件夹,并为每个文件夹创建一个pdf(带有ocr)(在../pdfs/文件夹中移动)。

0 个答案:

没有答案