“在处理名称序列时发现了意外的注释”

时间:2012-11-21 01:19:22

标签: java opennlp

我想在OpenNLP中对命名实体识别功能进行培训。 我写了一段代码 http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.namefind

我从一个简单的例子开始尝试训练“数字”并在这样的训练文件中标记所有\ d +:

In <START:number> 1941 <END>, Paramount Pictures produced a movie version of the play.

代码是:

static String markedFile    = "C:/MyStuff/eclipse_workspace/OpenNlpTest/src/NameFinderTraining/en-ner-number-marked.train";
    static String modelFile     = "C:/MyStuff/eclipse_workspace/OpenNlpTest/src/NameFinderTraining/en-ner-number-marked.bin";

    @SuppressWarnings("deprecation")
    public static void main(String[] args) throws Exception 
    {
        Charset charset = Charset.forName("UTF-8");
        ObjectStream<String> lineStream =
                new PlainTextByLineStream(new FileInputStream( markedFile), charset);
        ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream);

        TokenNameFinderModel model;

        try 
        {
            model = NameFinderME.train("en", "person", sampleStream,
                    Collections.<String, Object>emptyMap(), 100, 5);
        }
        finally 
        {
            sampleStream.close();
        }

        BufferedOutputStream modelOut = null;
        try 
        {
            modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
            model.serialize(modelOut);
        } 
        finally 
        {
            if (modelOut != null) 
                    modelOut.close();      
        }   
    }

我遇到以下异常:

Computing event counts...  java.io.IOException: Found unexpected annotation while handling a name sequence: until the ###<START:number>### 1950 <END>s

我的猜测是“数字”不在默认注释列表中。我该怎么办?如果我需要“自定义注释”,有人可以给我一个例子。

1 个答案:

答案 0 :(得分:8)

当无法正确识别标记时,OpenNLP会抛出此类异常。

尝试删除标记之后/之前的任何特殊字符。

<END>. is invalid.
<END> . is valid.