Question

我制作了一个CRF模型。我的数据集有24个类，此时我处于开始状态，因此我的训练数据只有1200个令牌/语料库。我训练了模型。在我的训练数据中，我使用了复数的令牌，如地址，照片，州，国家等。

现在在测试的时候，如果我以句子形式给这个模型提供多个令牌，那么它工作得很好但如果我以单数形式输入我的句子，如照片，状态等，那么它不会为其分配任何标签。

crf的这种行为看起来很奇怪。我已经探索了NER Feature Factory并使用了一些引理功能，但它也没有用。为模型构建共享cellFilter: 'number: 2', cellTooltip: 'Custom tooltip - maybe some help text'。

austen.prop

通过阅读# location of the training file trainFile = training_data_for_ner.txt # location where you would like to save (serialize) your # classifier; adding .gz at the end automatically gzips the file, # making it smaller, and faster to load serializeTo = ner-model.ser.gz # structure of your training file; this tells the classifier that # the word is in column 0 and the correct answer is in column 1 map = word=0,answer=1,pos=2,lemma=3 # This specifies the order of the CRF: order 1 means that features # apply at most to a class pair of previous class and current class # or current class and next class. maxLeft=1 # these are the features we'd like to train with # some are discussed below, the rest can be # understood by looking at NERFeatureFactory useClassFeature=true useWord=true # word character ngrams will be included up to length 6 as prefixes # and suffixes only useNGrams=true noMidNGrams=true maxNGramLeng=6 usePrev=true useNext=true useDisjunctive=true useSequences=true usePrevSequences=true # the last 4 properties deal with word shape features useTypeSeqs=true useTypeSeqs2=true useTypeySequences=true wordShape=chris2useLC # newly added features. useLemmas=true usePrevNextLemmas=true useLemmaAsWord=true useTags=true添加了最后四个功能。如果有人能帮助我解决这个问题，那么我将感激你。

Answer 1

你应该使用带有标记的代币进行重新训练。例如，请参阅https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/process/Stemmer.java（main方法）。

CRF模型训练复数，而不是单数

1 个答案: