Question

我正在尝试使用Stanford NLP的TokensRegex并尝试在文本中查找尺寸（例如100x120）。所以我的计划是首先将输入重新标记为进一步拆分这些标记（使用retokenize.rules.txt中提供的示例），然后搜索新模式。

然而，在进行重新标记后，只剩下空值来替换原始字符串：

The top level annotation
[Text=100x120 Tokens=[null-1, null-2, null-3] Sentences=[100x120]]

重新标记似乎工作正常（结果中有3个令牌），但值丢失了。如何在令牌列表中维护原始值？

我的retokenize.rules.txt文件（如演示中所示）：

tokens = { type: "CLASS", value:"edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation" }
options.matchedExpressionsAnnotationKey = tokens;
options.extractWithTokens = TRUE;
options.flatten = TRUE;
ENV.defaults["ruleType"] = "tokens"
ENV.defaultStringPatternFlags = 2
ENV.defaultResultAnnotationKey = tokens

{ pattern: ( /\d+(x|X)\d+/ ), result: Split($0[0], /x|X/, TRUE) }

主要方法：

public static void main(String[] args) throws IOException {
    //...
    text = "100x120";
    Properties properties = new Properties();
    properties.setProperty("tokenize.language", "de");
    properties.setProperty("annotators", tokenize,retokenize,ssplit,pos,lemma,ner");
    properties.setProperty("customAnnotatorClass.retokenize", "edu.stanford.nlp.pipeline.TokensRegexAnnotator");
    properties.setProperty("retokenize.rules", "retokenize.rules.txt");
    StanfordCoreNLP stanfordPipeline = new StanfordCoreNLP(properties);
    runPipeline(pipelineWithRetokenize, text);

}

管道：

public static void runPipeline(StanfordCoreNLP pipeline, String text) {
    Annotation annotation = new Annotation(text);
    pipeline.annotate(annotation);
    out.println();
    out.println("The top level annotation");
    out.println(annotation.toShorterString());
    //...
}

Answer 1

感谢您告诉我们。 CoreAnnotations.ValueAnnotation未被填充，我们将更新TokenRegex以填充该字段。

无论如何，您应该能够按照计划使用TokenRegex进行重新标记。大多数管道不依赖于ValueAnnotation，而是使用CoreAnnotations.TextAnnotation。您可以使用CoreAnnotations.TextAnnotation获取新标记的文本（每个标记都是CoreLabel，因此您也可以使用token.word（）访问它）。

有关如何获取不同注释的示例代码，请参阅TokensRegexRetokenizeDemo。

TokensRegex：重新标记后标记为空

1 个答案: