Question

我使用Stanford CoreNlp工具来标记文本，使得每个标记的引入偏移非常重要（我需要每个标记的偏移量以便稍后在Brat中使用它）。我的计划的相关部分如下：

pipeline.annotate(annotation);

        List<CoreMap> sentences =annotation.get(CoreAnnotations.SentencesAnnotation.class);
        if (sentences != null && !sentences.isEmpty()) {
            for (CoreMap sentence : sentences) {
                // CoreMap sentence = sentences.get(0);
                for (CoreMap token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
                    // out.println(token+"\t"+token.get(NamedEntityTagAnnotation.class));

                    words = token + "\t" + token.get(NamedEntityTagAnnotation.class);
                    String word_offset = token.toShorterString().toString();
                    wordsId.add(words);
                    wordsId1.add(words.substring(0, words.indexOf("-")).trim());
                    wordsId2.add(word_offset);
              System.Out.Println("Text_woffset.txt",word_offset+"\n" );
                }

输入= ＆＃34; D：太棒了！

CM：你好吗，Daniella？ {BR}

{NS}

D：我做得很好，除了我听到一点回声这一事实。

CM：哦。 {LG} Darn。

D：给我一点时间。

CM：好的。＆＃34;

我使用以下代码来读取输入：

Text = new Scanner(new File(Input)).useDelimiter("\\A").next();

使用此输入我得到错误的偏移量。例如，令牌＆＃34; Daniella＆＃34;偏移应该是[28 36]，但工具显示我[27,35]或在文本的中间，令牌有10到30个错误的偏移。你能告诉我使用tokenizer处理这种会话文本的方法吗？我把实际文本作为输入（以确保问题不是使用扫描仪），但问题仍然存在。

Answer 1

你想要的是附加到每个标记的CharacterOffsetBegin和CharacterOffsetEnd注释。这方面的简写是CoreLabel.begin()和CoreLabel.end()。对代码的一个小调整：标记可以是CoreLabel s（CoreMap的子类） - CoreLabel类有一堆实用程序方法，可以更轻松地使用它们。

作为一般规则，在类层次结构中，CoreLabel和Annotation都是CoreMap的子类，从语义上讲，Annotation几乎总是一个文档，CoreMap几乎总是一个句子，而CoreLabel几乎总是一个标记。

使用stanford tokenizer

1 个答案: