Question

我有以下句子（只有一个），当我使用下面的代码进行标记并知道每个单词的索引时，标记生成器将其视为两个句子，因为在＆＃34;大约＆＃34之后的句号;。我该如何解决这个问题：

String sentence = "09-Aug-2003 -- On Saturday, 9th August 2003, Daniel and I start with our Enduros approx. 100 kilometers from the confluence point."

Annotation document = new Annotation(sentence);
pipeline.annotate(document);
for (CoreLabel token : document.get(CoreAnnotations.TokensAnnotation.class)) {
     String word = token.get(CoreAnnotations.TextAnnotation.class);
     System.out.println(token.index(), word);
}

e.g。＆＃34; km＆＃34;的真实指数是20.但根据这段代码是2。

Answer 1

如果将以下内容添加到传递给pipeline

的Properties对象中

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit");
props.setProperty("ssplit.isOneSentence", "true");

StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

然后它不会将文本分成不同的句子。

（在此页面上搜索“ssplit”以查看所有其他选项http://nlp.stanford.edu/software/corenlp.shtml）

如何说服标记器单句正确工作

1 个答案: