Question

你好互联网的人，

我们在Stanford NLP API中遇到以下问题：我们有一个String，我们想要转换成一个句子列表。首先，我们使用了String sentenceString = Sentence.listToString(sentence);但listToString因为标记化而未返回原始文本。现在我们尝试以下列方式使用listToOriginalTextString：

private static List<String> getSentences(String text) {
        Reader reader = new StringReader(text);
        DocumentPreprocessor dp = new DocumentPreprocessor(reader);
        List<String> sentenceList = new ArrayList<String>();

        for (List<HasWord> sentence : dp) {
            String sentenceString = Sentence.listToOriginalTextString(sentence);
            sentenceList.add(sentenceString.toString());
        }

        return sentenceList;
    }

这不起作用。显然我们必须设置一个属性＆＃34;可逆的＆＃34;要真实，但我们不知道如何。我们怎么做到这一点？

一般来说，如何正确使用listToOriginalTextString？您需要做哪些准备工作？

此致 Khayet

Answer 1

如果我理解正确，您希望在标记化后将标记映射到原始输入文本。你可以这样做;

        //split via PTBTokenizer (PTBLexer)
        List<CoreLabel> tokens = PTBTokenizer.coreLabelFactory().getTokenizer(new StringReader(text)).tokenize();

        //do the processing using stanford sentence splitter (WordToSentenceProcessor)
        WordToSentenceProcessor processor = new WordToSentenceProcessor();
        List<List<CoreLabel>> splitSentences = processor.process(tokens);

        //for each sentence
        for (List<CoreLabel> s : splitSentences) {                

            //for each word
            for (CoreLabel token : s) {
                //here you can get the token value and position like;
                //token.value(), token.beginPosition(), token.endPosition()
            }    

        }

Answer 2

String sentenceStr = sentence.get(CoreAnnotations.TextAnnotation.class)

它为您提供原始文本。 JSONOutputter.java文件的示例：

l2.set("id", sentence.get(CoreAnnotations.SentenceIDAnnotation.class));
l2.set("index", sentence.get(CoreAnnotations.SentenceIndexAnnotation.class));
l2.set("sentenceOriginal",sentence.get(CoreAnnotations.TextAnnotation.class));
l2.set("line", sentence.get(CoreAnnotations.LineNumberAnnotation.class));

使用stanford NLP解析器后获取原始文本

2 个答案: