培训tesseract 4.0的问题

时间:2018-04-25 18:53:04

标签: tesseract training-data

现在我想用特定字体训练tesseract阿拉伯语 根据{{​​3}} 您必须先创建数据,然后才能创建数据

培训

此命令用于创建训练数据和评估列表

$TRAINING/tesstrain.sh --fonts_dir $FOLDER/simplified-arabic --lang ara --linedata_only \
  --noextract_font_properties --langdata_dir $FOLDER \
  --tessdata_dir $FOLDER/arabox/tessdata \
  --fontlist "Simplified Arabic Bold" --output_dir $FOLDER/araeval

输出

这是输出

        === Starting training for language 'ara'
[ر أبر 25 22:17:23 EET 2018] /usr/local/bin/text2image --fonts_dir=/home/amir-paymob/WorkSpace/learnopencv/simplified-arabic --font=Simplified Arabic Bold --outputbase=/tmp/font_tmp.sUvtJ9ehJT/sample_text.txt --text=/tmp/font_tmp.sUvtJ9ehJT/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.sUvtJ9ehJT
Rendered page 0 to file /tmp/font_tmp.sUvtJ9ehJT/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using Simplified Arabic Bold
[ر أبر 25 22:17:24 EET 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.sUvtJ9ehJT --fonts_dir=/home/amir-paymob/WorkSpace/learnopencv/simplified-arabic --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.bBjBa5bzUW/ara/ara.Simplified_Arabic_Bold.exp0 --max_pages=3 --font=Simplified Arabic Bold --text=/home/amir-paymob/WorkSpace/learnopencv/ara/ara.training_text
Rendered page 0 to file /tmp/tmp.bBjBa5bzUW/ara/ara.Simplified_Arabic_Bold.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
[ر أبر 25 22:17:25 EET 2018] /usr/local/bin/unicharset_extractor --output_unicharset /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset --norm_mode 2 /tmp/tmp.bBjBa5bzUW/ara/ara.Simplified_Arabic_Bold.exp0.box
Extracting unicharset from box file /tmp/tmp.bBjBa5bzUW/ara/ara.Simplified_Arabic_Bold.exp0.box
Wrote unicharset file /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset
[ر أبر 25 22:17:25 EET 2018] /usr/local/bin/set_unicharset_properties -U /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset -O /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset -X /tmp/tmp.bBjBa5bzUW/ara/ara.xheights --script_dir=/home/amir-paymob/WorkSpace/learnopencv
Loaded unicharset of size 13 from file /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset
Setting unichar properties
Setting script properties
Writing unicharset to file /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset

=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=/home/amir-paymob/WorkSpace/learnopencv/facerec/tessdata
[ر أبر 25 22:17:25 EET 2018] /usr/local/bin/tesseract /tmp/tmp.bBjBa5bzUW/ara/ara.Simplified_Arabic_Bold.exp0.tif /tmp/tmp.bBjBa5bzUW/ara/ara.Simplified_Arabic_Bold.exp0 lstm.train /home/amir-paymob/WorkSpace/learnopencv/ara/ara.config
Tesseract Open Source OCR Engine v4.0.0-beta.1-69-g10f4 with Leptonica
Page 1

=== Constructing LSTM training data ===
[ر أبر 25 22:17:25 EET 2018] /usr/local/bin/combine_lang_model --input_unicharset /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset --script_dir /home/amir-paymob/WorkSpace/learnopencv --words /home/amir-paymob/WorkSpace/learnopencv/ara/ara.wordlist --numbers /home/amir-paymob/WorkSpace/learnopencv/ara/ara.numbers --puncs /home/amir-paymob/WorkSpace/learnopencv/ara/ara.punc --output_dir /home/amir-paymob/WorkSpace/learnopencv/araeval --lang ara --pass_through_recoder --lang_is_rtl
Loaded unicharset of size 13 from file /tmp/tmp.bBjBa5bzUW/ara/ara.unicharset
Setting unichar properties
Setting script properties
Config file is optional, continuing...
Reducing Trie to SquishedDawg
Error during conversion of wordlists to DAWGs!!
Moving /tmp/tmp.bBjBa5bzUW/ara/ara.Simplified_Arabic_Bold.exp0.lstmf to /home/amir-paymob/WorkSpace/learnopencv/araeval

Completed training for language 'ara'

问题

Error during conversion of wordlists to DAWGs!!
  

我不明白那是什么?

使用lstm文件进行训练本身

$TRAINING/lstmtraining --debug_interval 0 --max_iterations 1 \
--traineddata $FOLDER/arabox/tessdata/ara.traineddata \
  --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
  --model_output $OUTPUT/base --learning_rate 20e-4 \
  --train_listfile $FOLDER/araeval/ara.training_files.txt \
  --max_iterations 5000 &>$OUTPUT/basetrain.log

它进入编码无限循环can't encode the character

现在

根据文件

  

必须提供大胆的元素。其他是可选的,但如果提供任何dawgs,还必须提供标点符号dawg。我们提供了一个新工具combine_lang_data,用于从traineddata和可选的单词列表中创建一个unicharset启动器。

但是combine_lang_data

是什么

0 个答案:

没有答案