使用stanford NLP解析器后获取原始文本

时间:2016-07-28 12:44:27

标签: java stanford-nlp

你好互联网的人,

我们在Stanford NLP API中遇到以下问题: 我们有一个String,我们想要转换成一个句子列表。 首先,我们使用了String sentenceString = Sentence.listToString(sentence);listToString因为标记化而未返回原始文本。现在我们尝试以下列方式使用listToOriginalTextString

private static List<String> getSentences(String text) {
        Reader reader = new StringReader(text);
        DocumentPreprocessor dp = new DocumentPreprocessor(reader);
        List<String> sentenceList = new ArrayList<String>();

        for (List<HasWord> sentence : dp) {
            String sentenceString = Sentence.listToOriginalTextString(sentence);
            sentenceList.add(sentenceString.toString());
        }

        return sentenceList;
    }

这不起作用。显然我们必须设置一个属性&#34;可逆的&#34;要真实,但我们不知道如何。我们怎么做到这一点?

一般来说,如何正确使用listToOriginalTextString?您需要做哪些准备工作?

此致 Khayet

2 个答案:

答案 0 :(得分:0)

如果我理解正确,您希望在标记化后将标记映射到原始输入文本。你可以这样做;

        //split via PTBTokenizer (PTBLexer)
        List<CoreLabel> tokens = PTBTokenizer.coreLabelFactory().getTokenizer(new StringReader(text)).tokenize();

        //do the processing using stanford sentence splitter (WordToSentenceProcessor)
        WordToSentenceProcessor processor = new WordToSentenceProcessor();
        List<List<CoreLabel>> splitSentences = processor.process(tokens);

        //for each sentence
        for (List<CoreLabel> s : splitSentences) {                

            //for each word
            for (CoreLabel token : s) {
                //here you can get the token value and position like;
                //token.value(), token.beginPosition(), token.endPosition()
            }    

        }

答案 1 :(得分:0)

String sentenceStr = sentence.get(CoreAnnotations.TextAnnotation.class)

它为您提供原始文本。 JSONOutputter.java文件的示例:

l2.set("id", sentence.get(CoreAnnotations.SentenceIDAnnotation.class));
l2.set("index", sentence.get(CoreAnnotations.SentenceIndexAnnotation.class));
l2.set("sentenceOriginal",sentence.get(CoreAnnotations.TextAnnotation.class));
l2.set("line", sentence.get(CoreAnnotations.LineNumberAnnotation.class));