Question

我正在使用Standford核心NLP。我试过以下例子。此示例可以标记文本中的单词。然而它也提取标点符号，如逗号，句号等。我想知道如何设置允许不提取标点符号的属性，或者还有其他方法来做同样的事情。这是代码示例。我知道使用Python很容易，但不知道如何用Java做到这一点。请建议。

    props = new Properties();
    props.setProperty("annotators", "tokenize, ssplit");
    pipeline = new StanfordCoreNLP(props);
    String text = "this is simple text written in English,Spanish etc."

// create an empty Annotation just with the given text
    Annotation document = new Annotation(text);

   pipeline.annotate(document);

   List<CoreMap> sentences = document.get(SentencesAnnotation.class);

   for(CoreMap sentence: sentences) {
     for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
    // this is the text of the token
    String word = token.get(TextAnnotation.class);
      }
   }

Answer 1

我们没有任何标记器选项来跳过这些选项，但这并不困难。标点字符串是一个封闭的类。

您可以使用正则表达式匹配标点符号。（使用\p{Punct};请参阅例如Punctuation Regex in Java）。然后只需删除其文本内容与此正则表达式匹配的标记。

如何使用Stanford NLP避免标点化标点化

1 个答案: