Question

我正在使用OpenNLP创建我自己的POS Tagger，如下所示

public Trainer(String trainingData, String modelSavePath, String dictionary){

        try {
            dataIn = new MarkableFileInputStreamFactory(
                    new File(classLoader.getResource(trainingData).getFile()));

            lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
            ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);

            POSTaggerFactory fac=new POSTaggerFactory();

        if(dictionary!=null && dictionary.length()>0)
        {
            fac.setDictionary(new Dictionary(new FileInputStream(classLoader.getResource(dictionary).getFile())));
        }
            model = POSTaggerME.train("en", sampleStream, TrainingParameters.defaultParams(), fac);

        } catch (IOException e) {
            // Failed to read or parse training data, training failed
            e.printStackTrace();
        } finally {
            if (lineStream != null) {
                try {
                    lineStream.close();
                } catch (IOException e) {
                    // Not an issue, training already finished.
                    // The exception should be logged and investigated
                    // if part of a production system.
                    e.printStackTrace();
                }
            }
        }

OutputStream modelOut = null;
        try {
            modelOut = new BufferedOutputStream(new FileOutputStream(modelSavePath));
            //modelOut = new BufferedOutputStream(new FileOutputStream(new File(getClass().getResource(modelSavePath).toURI())));
            model.serialize(modelOut);
        } catch (IOException e) {
            // Failed to save model
            e.printStackTrace();
        } finally {
            if (modelOut != null) {
                try {
                    modelOut.close();
                } catch (IOException e) {
                    // Failed to correctly save model.
                    // Written model might be invalid.
                    e.printStackTrace();
                }
            }

        }
    }

效果很好并将新创建的模型保存为bin文件。我希望字典术语覆盖输入中的单词，我不会看到这种行为。

所以考虑一下输入

Mary_NNP had_VBD a_DT little_JJ lamb_NN

现在我想要标签

lamb_LAMB

所以把它放在字典中

<?xml version="1.0" encoding="UTF-8"?>
<dictionary>
<entry tags="LAMB">
<token>lamb</token>
</entry>
</dictionary>

但是当我尝试新训练的标记器时，我仍然会看到NN的标记为lamb

但是，如果我的训练数据是

Mary_NNP had_VBD a_DT little_JJ lamb_LAMB

然后按预期工作。此外，如果我的训练数据中根本没有单词lamb，则自定义生成的标记器会使用字典标记。

如何确保字典标记始终覆盖训练数据标记？我是否必须以任何方式修改培训？

OpenNLP Custom POS Tagger：如何使Dictionary覆盖输入标签

0 个答案: