Question

CoreNLP的标记化会更改句子文本。将白色空间分隔的标记拼接在一起并不是真正的重建。如果句子包含圆括号和其他标点符号，则事情变得复杂。请参阅下面的代码块。

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit");
pipeline = new StanfordCoreNLP(props);

Annotation document = new Annotation(paragraph);
pipeline.annotate(document);

List<CoreMap>sentences = document.get(SentencesAnnotation.class);

List<String> sentenceList = new ArrayList<>();
for (CoreMap sentence : sentences) 
{
    //How to get the original text of sentence?
}

Answer 1

回答我自己的问题。它非常简单。插入以下行代替问题代码块中的注释。

String sentenceString = Sentence.listToOriginalTextString(sentence.get(TokensAnnotation.class));

Answer 2

for (CoreMap sentence : sentences) 
{
    String sentenceStr = sentence.get(CoreAnnotations.TextAnnotation.class)
}

如何在CoreNLP执行ssplit后得到句子的原始文本？

2 个答案: