Question

我在POSTagger中训练了一种新语言的模型。不幸的是，标记器错误地分类了大数字（不是写成单词）。

例如：

“有2名医生。” 2将被正确归类为NUM。
“市长贪污274556美元。” 274556将被分类作为VERB或NOUN。

英国模特似乎并未受此影响。如何确保所有数字（不写为单词）都归类为NUM？

编辑最新的.prop文件

    ## tagger training invoked at Thu May 07 19:42:46 CEST 2015 with arguments:
                   model = models/czech.tagger
                    arch = bidirectional, naacl2003unknowns, words(0,3),words(0,4),words(0,5), unicodeshapes(-2,2), allunicodeshapes(-2,2)
            wordFunction =
               trainFile = format=TSV,corpora/train.corpus
         closedClassTags =
 closedClassTagThreshold = 40
 curWordMinFeatureThresh = 2
                   debug = false
             debugPrefix =
            tagSeparator = /
                encoding = UTF-8
              iterations = 100
                    lang =
    learnClosedClassTags = false
        minFeatureThresh = 5
           openClassTags = ADJ ADV INTJ NOUN PROPN VERB
rareWordMinFeatureThresh = 10
          rareWordThresh = 5
                  search = owlqn2
                    sgml = false
            sigmaSquared = 0.5
                   regL1 = 1.0

testFile上的结果：

Results on 10148 sentences and 174254 words, of which 12199 were unknown.
Total sentences right: 7983 (78.665747%); wrong: 2165 (21.334253%).
Total tags right: 171223 (98.260585%); wrong: 3031 (1.739415%).
Unknown words right: 11080 (90.827117%); wrong: 1119 (9.172883%).

标记中的错误分类（与上述句子大致相同）：

Jsou    VERB
zde ADV
2   NUM
doktoři NOUN
.   PUNCT

Místní  ADJ
radní   NOUN
zpronevěřil VERB
2474556 NOUN
korun   NOUN
.   PUNCT

POSTagger数字分类

0 个答案: