OpenNLP Custom POS Tagger:如何使Dictionary覆盖输入标签

时间:2017-03-22 18:15:21

标签: dictionary nlp opennlp pos-tagger

我正在使用OpenNLP创建我自己的POS Tagger,如下所示

public Trainer(String trainingData, String modelSavePath, String dictionary){

        try {
            dataIn = new MarkableFileInputStreamFactory(
                    new File(classLoader.getResource(trainingData).getFile()));

            lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
            ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);

            POSTaggerFactory fac=new POSTaggerFactory();

        if(dictionary!=null && dictionary.length()>0)
        {
            fac.setDictionary(new Dictionary(new FileInputStream(classLoader.getResource(dictionary).getFile())));
        }
            model = POSTaggerME.train("en", sampleStream, TrainingParameters.defaultParams(), fac);

        } catch (IOException e) {
            // Failed to read or parse training data, training failed
            e.printStackTrace();
        } finally {
            if (lineStream != null) {
                try {
                    lineStream.close();
                } catch (IOException e) {
                    // Not an issue, training already finished.
                    // The exception should be logged and investigated
                    // if part of a production system.
                    e.printStackTrace();
                }
            }
        }

OutputStream modelOut = null;
        try {
            modelOut = new BufferedOutputStream(new FileOutputStream(modelSavePath));
            //modelOut = new BufferedOutputStream(new FileOutputStream(new File(getClass().getResource(modelSavePath).toURI())));
            model.serialize(modelOut);
        } catch (IOException e) {
            // Failed to save model
            e.printStackTrace();
        } finally {
            if (modelOut != null) {
                try {
                    modelOut.close();
                } catch (IOException e) {
                    // Failed to correctly save model.
                    // Written model might be invalid.
                    e.printStackTrace();
                }
            }

        }
    }

效果很好并将新创建的模型保存为bin文件。我希望字典术语覆盖输入中的单词,我不会看到这种行为。

所以考虑一下输入

Mary_NNP had_VBD a_DT little_JJ lamb_NN

现在我想要标签

lamb_LAMB

所以把它放在字典中

<?xml version="1.0" encoding="UTF-8"?>
<dictionary>
<entry tags="LAMB">
<token>lamb</token>
</entry>
</dictionary>

但是当我尝试新训练的标记器时,我仍然会看到NN的标记为lamb

但是,如果我的训练数据是

Mary_NNP had_VBD a_DT little_JJ lamb_LAMB

然后按预期工作。此外,如果我的训练数据中根本没有单词lamb,则自定义生成的标记器会使用字典标记。

如何确保字典标记始终覆盖训练数据标记?我是否必须以任何方式修改培训?

0 个答案:

没有答案