现在我想用特定字体训练tesseract
阿拉伯语
根据{{3}}
您必须先创建数据,然后才能创建数据
此命令用于创建训练数据和评估列表
$TRAINING/tesstrain.sh --fonts_dir $FOLDER/simplified-arabic --lang ara --linedata_only \
--noextract_font_properties --langdata_dir $FOLDER \
--tessdata_dir $FOLDER/arabox/tessdata \
--fontlist "Simplified Arabic Bold" --output_dir $FOLDER/araeval
这是输出
=== Starting training for language 'ara'
[ر أبر 25 22:17:23 EET 2018] /usr/local/bin/text2image --fonts_dir=/home/amir-paymob/WorkSpace/learnopencv/simplified-arabic --font=Simplified Arabic Bold --outputbase=/tmp/font_tmp.sUvtJ9ehJT/sample_text.txt --text=/tmp/font_tmp.sUvtJ9ehJT/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.sUvtJ9ehJT
Rendered page 0 to file /tmp/font_tmp.sUvtJ9ehJT/sample_text.txt.tif
=== Phase I: Generating training images ===
Rendering using Simplified Arabic Bold
[ر أبر 25 22:17:24 EET 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.sUvtJ9ehJT --fonts_dir=/home/amir-paymob/WorkSpace/learnopencv/simplified-arabic --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.bBjBa5bzUW/ara/ara.Simplified_Arabic_Bold.exp0 --max_pages=3 --font=Simplified Arabic Bold --text=/home/amir-paymob/WorkSpace/learnopencv/ara/ara.training_text
Rendered page 0 to file /tmp/tmp.bBjBa5bzUW/ara/ara.Simplified_Arabic_Bold.exp0.tif
=== Phase UP: Generating unicharset and unichar properties files ===
[ر أبر 25 22:17:25 EET 2018] /usr/local/bin/unicharset_extractor --output_unicharset /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset --norm_mode 2 /tmp/tmp.bBjBa5bzUW/ara/ara.Simplified_Arabic_Bold.exp0.box
Extracting unicharset from box file /tmp/tmp.bBjBa5bzUW/ara/ara.Simplified_Arabic_Bold.exp0.box
Wrote unicharset file /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset
[ر أبر 25 22:17:25 EET 2018] /usr/local/bin/set_unicharset_properties -U /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset -O /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset -X /tmp/tmp.bBjBa5bzUW/ara/ara.xheights --script_dir=/home/amir-paymob/WorkSpace/learnopencv
Loaded unicharset of size 13 from file /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset
Setting unichar properties
Setting script properties
Writing unicharset to file /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset
=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=/home/amir-paymob/WorkSpace/learnopencv/facerec/tessdata
[ر أبر 25 22:17:25 EET 2018] /usr/local/bin/tesseract /tmp/tmp.bBjBa5bzUW/ara/ara.Simplified_Arabic_Bold.exp0.tif /tmp/tmp.bBjBa5bzUW/ara/ara.Simplified_Arabic_Bold.exp0 lstm.train /home/amir-paymob/WorkSpace/learnopencv/ara/ara.config
Tesseract Open Source OCR Engine v4.0.0-beta.1-69-g10f4 with Leptonica
Page 1
=== Constructing LSTM training data ===
[ر أبر 25 22:17:25 EET 2018] /usr/local/bin/combine_lang_model --input_unicharset /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset --script_dir /home/amir-paymob/WorkSpace/learnopencv --words /home/amir-paymob/WorkSpace/learnopencv/ara/ara.wordlist --numbers /home/amir-paymob/WorkSpace/learnopencv/ara/ara.numbers --puncs /home/amir-paymob/WorkSpace/learnopencv/ara/ara.punc --output_dir /home/amir-paymob/WorkSpace/learnopencv/araeval --lang ara --pass_through_recoder --lang_is_rtl
Loaded unicharset of size 13 from file /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset
Setting unichar properties
Setting script properties
Config file is optional, continuing...
Reducing Trie to SquishedDawg
Error during conversion of wordlists to DAWGs!!
Moving /tmp/tmp.bBjBa5bzUW/ara/ara.Simplified_Arabic_Bold.exp0.lstmf to /home/amir-paymob/WorkSpace/learnopencv/araeval
Completed training for language 'ara'
Error during conversion of wordlists to DAWGs!!
我不明白那是什么?
使用lstm文件进行训练本身
$TRAINING/lstmtraining --debug_interval 0 --max_iterations 1 \
--traineddata $FOLDER/arabox/tessdata/ara.traineddata \
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
--model_output $OUTPUT/base --learning_rate 20e-4 \
--train_listfile $FOLDER/araeval/ara.training_files.txt \
--max_iterations 5000 &>$OUTPUT/basetrain.log
它进入编码无限循环can't encode the character
根据文件
必须提供大胆的元素。其他是可选的,但如果提供任何dawgs,还必须提供标点符号dawg。我们提供了一个新工具
combine_lang_data
,用于从traineddata
和可选的单词列表中创建一个unicharset
启动器。
但是combine_lang_data