OpenNLP句子训练示例

时间:2015-12-24 19:21:55

标签: java opennlp training-data sentence

我试图使用官方OpenNLP网站手册示例来训练一个新模型,这里是一个例子:


    Charset charset = Charset.forName("UTF-8");
    ObjectStream lineStream = new PlainTextByLineStream(new FileInputStream("en-sent.train"), charset);
    ObjectStream sampleStream = new SentenceSampleStream(lineStream);
    SentenceModel model;
    try {
      model = SentenceDetectorME.train("en", sampleStream, true, null, TrainingParameters.defaultParams());
    } finally {
      sampleStream.close();
    }
    OutputStream modelOut = null;
    try {
      modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
      model.serialize(modelOut);
    } finally {
      if (modelOut != null) 
      modelOut.close();
    }

问题在于2º线,

    
ObjectStream lineStream = new PlainTextByLineStream(new FileInputStream("en-sent.train"), charset);

帮助我说: 已过时。请改用PlainTextByLineStream(InputStreamFactory,Charset)。 但是......我不知道如何使用这个构造函数。我想一个例子使用相同的语料库文件来使用这个不推荐的构造函数。

我已经编写了下一个代码,使用了opennlp帮助和2种方法来使用train方法,不推荐使用和doc doc中的建议:

    Charset charset = Charset.forName("UTF-8");
    InputStreamFactory inputStreamFactory=null;
    ObjectStream<String> lineStream=null;
    ObjectStream<SentenceSample> sampleStream=null;
    SentenceModel model=null;
    OutputStream modelOut = null;
    try{
        inputStreamFactory=InputStreamFactory.class.newInstance();
        lineStream=new PlainTextByLineStream(inputStreamFactory,charset);
        sampleStream = new SentenceSampleStream(lineStream);
        //The deprecated:
        model = SentenceDetectorME.train("en", sampleStream, true, null, TrainingParameters.defaultParams());
        //The sugested:
        model = SentenceDetectorME.train("en", sampleStream, new SentenceDetectorFactory(), new TrainingParameters()); 
    } catch (InstantiationException e2){
        e2.printStackTrace();
    } catch (IllegalAccessException e2){
        e2.printStackTrace();
    } catch (IOException e){
        e.printStackTrace();
    }finally {
        try{
            sampleStream.close();
        } catch (IOException e){
            e.printStackTrace();
        }
    }
    try {
        modelOut = new BufferedOutputStream(new FileOutputStream(new File("modelFile")));
        model.serialize(modelOut);
    } catch (FileNotFoundException e){
        e.printStackTrace();
    } catch (IOException e){
        e.printStackTrace();
    } finally {
        if (modelOut != null) try{
            modelOut.close();
        } catch (IOException e){
            e.printStackTrace();
        }      
    }

但是在这个新代码中,我不知道在哪里获取我的语料库数据文件。 有什么想法吗?

1 个答案:

答案 0 :(得分:1)

您必须使用所需的数据文件初始化inputStreamFactory,然后使用

inputStreamFactory = new MarkableFileInputStreamFactory(
        new File("en-sent.train"));