你好互联网的人,
我们在Stanford NLP API中遇到以下问题:
我们有一个String,我们想要转换成一个句子列表。
首先,我们使用了String sentenceString = Sentence.listToString(sentence);
但listToString
因为标记化而未返回原始文本。现在我们尝试以下列方式使用listToOriginalTextString
:
private static List<String> getSentences(String text) {
Reader reader = new StringReader(text);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
List<String> sentenceList = new ArrayList<String>();
for (List<HasWord> sentence : dp) {
String sentenceString = Sentence.listToOriginalTextString(sentence);
sentenceList.add(sentenceString.toString());
}
return sentenceList;
}
这不起作用。显然我们必须设置一个属性&#34;可逆的&#34;要真实,但我们不知道如何。我们怎么做到这一点?
一般来说,如何正确使用listToOriginalTextString?您需要做哪些准备工作?
此致 Khayet
答案 0 :(得分:0)
如果我理解正确,您希望在标记化后将标记映射到原始输入文本。你可以这样做;
//split via PTBTokenizer (PTBLexer)
List<CoreLabel> tokens = PTBTokenizer.coreLabelFactory().getTokenizer(new StringReader(text)).tokenize();
//do the processing using stanford sentence splitter (WordToSentenceProcessor)
WordToSentenceProcessor processor = new WordToSentenceProcessor();
List<List<CoreLabel>> splitSentences = processor.process(tokens);
//for each sentence
for (List<CoreLabel> s : splitSentences) {
//for each word
for (CoreLabel token : s) {
//here you can get the token value and position like;
//token.value(), token.beginPosition(), token.endPosition()
}
}
答案 1 :(得分:0)
String sentenceStr = sentence.get(CoreAnnotations.TextAnnotation.class)
它为您提供原始文本。 JSONOutputter.java文件的示例:
l2.set("id", sentence.get(CoreAnnotations.SentenceIDAnnotation.class));
l2.set("index", sentence.get(CoreAnnotations.SentenceIndexAnnotation.class));
l2.set("sentenceOriginal",sentence.get(CoreAnnotations.TextAnnotation.class));
l2.set("line", sentence.get(CoreAnnotations.LineNumberAnnotation.class));