Question

此SO answer表明，使用.tif文件的培训tesseract优于.png个文件，因为.tif文件可以有多个页面，因此可以有更大的培训样本。然而，这个SO question讨论了同时训练多个图像的程序。更重要的是，man页面，例如mftraining表示它可以接受多个培训文件。

有没有理由不训练多个单独的图像文件？

Answer 1

似乎使用多个图像来训练单个字体上的tesseract似乎工作得很好。以下是我采用的工作流程草图：

# Convert files to .pdf
convert -density 600 Page1.pdf eng1.MyNewFont.exp1.png
convert -density 600 Page2.pdf eng1.MyNewFont.exp2.png

# Create .box files
tesseract eng1.MyNewFont.exp1.png eng1.MyNewFont.exp1 -l eng batch.nochop makebox
tesseract eng1.MyNewFont.exp2.png eng1.MyNewFont.exp2 -l eng batch.nochop makebox

## correct boxes with jTessBoxEditor or another box editor ##

# Create two new box.tr files: eng1.MyNewFont.exp1.box.tr and eng1.MyNewFont.exp2.box.tr

tesseract eng1.MyNewFont.exp1.png eng1.MyNewFont.exp1.box -l eng1 nobatch box.train.stderr
tesseract eng1.MyNewFont.exp2.png eng1.MyNewFont.exp2.box -l eng1 nobatch box.train.stderr

# Extract characters from the two .box files
unicharset_extractor eng1.MyNewFont.exp1.box eng1.MyNewFont.exp2.box 

echo "MyNewFont 0 0 0 0 0" >> font_properties

# train using the two new box.tr files.
mftraining -F font_properties -U unicharset -O eng1.unicharset eng1.MyNewFont.exp1.box.tr eng1.MyNewFont.exp2.box.tr 
cntraining eng1.MyNewFont.exp1.box.tr eng1.MyNewFont.exp2.box.tr

## rename files
mv inttemp  eng1.inttemp
mv normproto  eng1.normproto
mv pffmtable  eng1.pffmtable
mv shapetable  eng1.shapetable

combine_tessdata eng1. ## create .traineddata file.

Answer 2

你当然可以训练多个图像文件; Tesseract会将它们视为具有不同的单独字体。并且图像数量有限制（64）。如果他们共享一个共同的字体，最好将它们放在多页TIFF中。根据其规格，TIFF文件可以是容纳许多图像的容器。

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract https://en.wikipedia.org/wiki/Tagged_Image_File_Format

Tesseract：多页面培训文件与多个单独文件的优势？

2 个答案: