Question

我正在使用CoreNLP在多行英文文本中注释NE。在执行以下操作时：

<html>
    <head><title></title></head>
    <body>
        <header>
            <a href="">
                <img class="center" src="http://icons.iconarchive.com/icons/custom-icon-design/pretty-office-9/256/teddy-bear-icon.png" width="40%"/>
            </a>
        </header>
    </body>
</html>

句子分裂工作正常并且识别两个句子。但是，当我按如下方式使用NER分类时：

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
props.put("ssplit.newlineIsSentenceBreak", "always");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String contentStr = "John speaks with Martin\n\nJeremy talks to him too.";
Annotation document 
= new  Annotation(contentStr);
pipeline.annotate(document);
List<CoreMap> sents = document.get(SentencesAnnotation.class);
for (int i = 0; i < sents.size(); i++) {
    System.out.println("sentence " + i + " "+ sents.get(i));
}

我收到以下错误消息：

CRFClassifier classifier = CRFClassifier.getClassifier("edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz", props);
String classifiedStr = classifier.classifyWithInlineXML(contentStr);

并且分类器似乎将所有文本视为一个句子，导致错误识别实体“Martin Jeremy”而不是两个不同的实体。

知道什么是错的吗？

Answer 1

CRFClassifier.getClassifier所采用的属性与StanfordCoreNLP构造函数所使用的属性不同，这就是为什么会出现选项未知的错误。

它将被设置，但它不会在运行时使用。

从here，您会发现需要设置SeqClassifierFlags的属性。您需要设置tokenizerOptions，并将选项设置为"tokenizeNLs = true"，将新行视为代币。

底线，在获取分类器之前设置属性如下。它不应该给你未知属性的错误，它应该按预期工作。

Properties props = new Properties();
props.put("tokenizerOptions", "tokenizeNLs=true");

CRFClassifier classifier = CRFClassifier.getClassifier("edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz", props);
String classifiedStr = classifier.classifyWithInlineXML(contentStr);

CRFClassifier不识别句子分割器选项

1 个答案: