我在POSTagger中训练了一种新语言的模型。不幸的是,标记器错误地分类了大数字(不是写成单词)。
例如:
“有2名医生。” 2将被正确归类为NUM。
“市长贪污274556美元。” 274556将被分类 作为VERB或NOUN。
英国模特似乎并未受此影响。如何确保所有数字(不写为单词)都归类为NUM?
编辑最新的.prop文件
## tagger training invoked at Thu May 07 19:42:46 CEST 2015 with arguments:
model = models/czech.tagger
arch = bidirectional, naacl2003unknowns, words(0,3),words(0,4),words(0,5), unicodeshapes(-2,2), allunicodeshapes(-2,2)
wordFunction =
trainFile = format=TSV,corpora/train.corpus
closedClassTags =
closedClassTagThreshold = 40
curWordMinFeatureThresh = 2
debug = false
debugPrefix =
tagSeparator = /
encoding = UTF-8
iterations = 100
lang =
learnClosedClassTags = false
minFeatureThresh = 5
openClassTags = ADJ ADV INTJ NOUN PROPN VERB
rareWordMinFeatureThresh = 10
rareWordThresh = 5
search = owlqn2
sgml = false
sigmaSquared = 0.5
regL1 = 1.0
testFile上的结果:
Results on 10148 sentences and 174254 words, of which 12199 were unknown.
Total sentences right: 7983 (78.665747%); wrong: 2165 (21.334253%).
Total tags right: 171223 (98.260585%); wrong: 3031 (1.739415%).
Unknown words right: 11080 (90.827117%); wrong: 1119 (9.172883%).
标记中的错误分类(与上述句子大致相同):
Jsou VERB
zde ADV
2 NUM
doktoři NOUN
. PUNCT
Místní ADJ
radní NOUN
zpronevěřil VERB
2474556 NOUN
korun NOUN
. PUNCT