我制作了一个CRF模型。我的数据集有24个类,此时我处于开始状态,因此我的训练数据只有1200个令牌/语料库。我训练了模型。在我的训练数据中,我使用了复数的令牌,如地址,照片,州,国家等。
现在在测试的时候,如果我以句子形式给这个模型提供多个令牌,那么它工作得很好但如果我以单数形式输入我的句子,如照片,状态等,那么它不会为其分配任何标签。
crf的这种行为看起来很奇怪。我已经探索了NER Feature Factory并使用了一些引理功能,但它也没有用。为模型构建共享cellFilter: 'number: 2', cellTooltip: 'Custom tooltip - maybe some help text'
。
austen.prop
通过阅读# location of the training file
trainFile = training_data_for_ner.txt
# location where you would like to save (serialize) your
# classifier; adding .gz at the end automatically gzips the file,
# making it smaller, and faster to load
serializeTo = ner-model.ser.gz
# structure of your training file; this tells the classifier that
# the word is in column 0 and the correct answer is in column 1
map = word=0,answer=1,pos=2,lemma=3
# This specifies the order of the CRF: order 1 means that features
# apply at most to a class pair of previous class and current class
# or current class and next class.
maxLeft=1
# these are the features we'd like to train with
# some are discussed below, the rest can be
# understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
# word character ngrams will be included up to length 6 as prefixes
# and suffixes only
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useDisjunctive=true
useSequences=true
usePrevSequences=true
# the last 4 properties deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
# newly added features.
useLemmas=true
usePrevNextLemmas=true
useLemmaAsWord=true
useTags=true
添加了最后四个功能。如果有人能帮助我解决这个问题,那么我将感激你。
答案 0 :(得分:0)
你应该使用带有标记的代币进行重新训练。例如,请参阅https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/process/Stemmer.java(main
方法)。