无法使用opennlp和java训练location.bin

时间:2014-05-18 16:42:43

标签: java eclipse opennlp

我正在尝试使用java中的opennlp训练en-ner-location.bin文件事情是我得到了以下格式的训练文本文件 <START:location> Fontana <END> <START:location> Palo Verde <END> <START:location> Picacho <END>

我使用以下代码训练文件

import java.io.BufferedOutputStream;
  import java.io.BufferedReader;
  import java.io.File;
  import java.io.FileInputStream;
  import java.io.FileOutputStream;
  import java.io.FileReader;
  import java.io.IOException;
  import java.io.InputStream;
  import java.nio.charset.Charset;
  import java.util.Collections;
  import opennlp.tools.namefind.NameFinderME;
  import opennlp.tools.namefind.NameSample;
  import opennlp.tools.namefind.NameSampleDataStream;
  import opennlp.tools.namefind.TokenNameFinderModel;
  import opennlp.tools.tokenize.Tokenizer;
  import opennlp.tools.tokenize.TokenizerME;
  import opennlp.tools.tokenize.TokenizerModel;
  import opennlp.tools.util.ObjectStream;
  import opennlp.tools.util.PlainTextByLineStream;
  import opennlp.tools.util.Span;

  public class TrainNames {   
@SuppressWarnings("deprecation")
public void TrainNames() throws IOException{
    File fileTrainer=new File("citytrain.txt");
    File output=new File("en-ner-location.bin");
    ObjectStream<String> lineStream = new PlainTextByLineStream(new    FileInputStream(fileTrainer), "UTF-8");
    ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream);
    System.out.println("lineStream = " + lineStream);
    TokenNameFinderModel model = NameFinderME.train("en", "location", sampleStream, Collections.<String, Object>emptyMap(), 1, 0);

    BufferedOutputStream modelOut = null;
    try {
        modelOut = new BufferedOutputStream(new FileOutputStream(output));
        model.serialize(modelOut);
    } finally {
        if (modelOut != null)
            modelOut.close();
    }
}
  }

我没有任何错误或警告,但是当我尝试从这样的字符串中获取城市名称时,cnt =“John计划专攻UC Fontana的电气工程并从事IBM的职业生涯。”;它返回整个字符串 任何人都可以告诉我为什么...... ??

1 个答案:

答案 0 :(得分:0)

欢迎来到SO!看起来您需要围绕每个位置注释的更多上下文。我相信现在openNLP认为你正在训练它来找到单词(任何单词),因为你的训练数据只有一个单词。你需要在整个句子中注释位置,你需要至少几百个样本才能看到好的结果。

也可以看到这个答案: How I train an Named Entity Recognizer identifier in OpenNLP?