我正在使用OpenNLP创建我自己的POS Tagger,如下所示
public Trainer(String trainingData, String modelSavePath, String dictionary){
try {
dataIn = new MarkableFileInputStreamFactory(
new File(classLoader.getResource(trainingData).getFile()));
lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);
POSTaggerFactory fac=new POSTaggerFactory();
if(dictionary!=null && dictionary.length()>0)
{
fac.setDictionary(new Dictionary(new FileInputStream(classLoader.getResource(dictionary).getFile())));
}
model = POSTaggerME.train("en", sampleStream, TrainingParameters.defaultParams(), fac);
} catch (IOException e) {
// Failed to read or parse training data, training failed
e.printStackTrace();
} finally {
if (lineStream != null) {
try {
lineStream.close();
} catch (IOException e) {
// Not an issue, training already finished.
// The exception should be logged and investigated
// if part of a production system.
e.printStackTrace();
}
}
}
OutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new FileOutputStream(modelSavePath));
//modelOut = new BufferedOutputStream(new FileOutputStream(new File(getClass().getResource(modelSavePath).toURI())));
model.serialize(modelOut);
} catch (IOException e) {
// Failed to save model
e.printStackTrace();
} finally {
if (modelOut != null) {
try {
modelOut.close();
} catch (IOException e) {
// Failed to correctly save model.
// Written model might be invalid.
e.printStackTrace();
}
}
}
}
效果很好并将新创建的模型保存为bin文件。我希望字典术语覆盖输入中的单词,我不会看到这种行为。
所以考虑一下输入
Mary_NNP had_VBD a_DT little_JJ lamb_NN
现在我想要标签
lamb_LAMB
所以把它放在字典中
<?xml version="1.0" encoding="UTF-8"?>
<dictionary>
<entry tags="LAMB">
<token>lamb</token>
</entry>
</dictionary>
但是当我尝试新训练的标记器时,我仍然会看到NN
的标记为lamb
但是,如果我的训练数据是
Mary_NNP had_VBD a_DT little_JJ lamb_LAMB
然后按预期工作。此外,如果我的训练数据中根本没有单词lamb
,则自定义生成的标记器会使用字典标记。
如何确保字典标记始终覆盖训练数据标记?我是否必须以任何方式修改培训?