Question

我想将地址（以及其他可能基于规则的实体）添加到NER管道，令牌Regex似乎是这样做的非常有用的DSL。在https://stackoverflow.com/a/42604225之后，我创建了以下规则文件：

ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }

{ pattern: ([{ner:"NUMBER"}] [{pos:"NN"}|{pos:"NNP"}] /ave(nue)?|st(reet)?|boulevard|blvd|r(oa)?d/), action: Annotate($0, ner, "address") }

这是一个scala repl会话，显示了我如何尝试建立注释管道。

@ import edu.stanford.nlp.pipeline.{StanfordCoreNLP, CoreDocument}

@ import edu.stanford.nlp.util.PropertiesUtils.asProperties

@ val pipe = new StanfordCoreNLP(asProperties(
  "customAnnotatorClass.tokensregex", "edu.stanford.nlp.pipeline.TokensRegexAnnotator",
  "annotators", "tokenize,ssplit,pos,lemma,ner,tokensregex",
  "ner.combinationMode", "HIGH_RECALL",
  "tokensregex.rules", "addresses.tregx"))
pipe: StanfordCoreNLP = edu.stanford.nlp.pipeline.StanfordCoreNLP@2ce6a051

@ val doc = new CoreDocument("Adam Smith lived at 123 noun street in Glasgow, Scotland")
doc: CoreDocument = Adam Smith lived at 123 noun street in Glasgow, Scotland

@ pipe.annotate(doc)

@ doc.sentences.get(0).nerTags
res5: java.util.List[String] = [PERSON, PERSON, O, O, address, address, address, O, CITY, O, COUNTRY]

@ doc.entityMentions
res6: java.util.List[edu.stanford.nlp.pipeline.CoreEntityMention] = [Adam Smith, 123, Glasgow, Scotland]

如您所见，该地址在句子的nerTags中已正确标记，但未在文档entityMentions中显示。有办法吗？

此外，从文档中可以找到从单个匹配中识别tokenregex的两个相邻匹配的方法（假设我有一组更复杂的regexes；在当前示例中，我仅精确匹配了3个令牌，因此我可以算一下令牌）？

我尝试使用regexner和此处https://stanfordnlp.github.io/CoreNLP/regexner.html所述的令牌正则表达式来实现它，但是我似乎无法正常工作。

由于我在Scala中工作，因此很高兴能深入Java API来使它起作用，而不是在必要时摆弄属性和资源文件。

Answer 1

是的，我最近添加了一些更改（在GitHub版本中），以使此操作更容易！确保从GitHub下载最新版本。尽管我们打算很快发布Stanford CoreNLP 3.9.2，但它会进行这些更改。

如果您阅读此页面，则可以了解NERCombinerAnnotator运行的完整NER管道。

https://stanfordnlp.github.io/CoreNLP/ner.html

此外，此处的TokensRegex上有很多文章：

https://stanfordnlp.github.io/CoreNLP/tokensregex.html

基本上，您要做的是运行ner注释器，并使用它的TokensRegex子注释器。假设您在名为my_ner.rules的文件中有一些命名实体规则。

您可以运行以下命令：

java -Xmx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.additional.tokensregex.rules my_ner.rules -outputFormat text -file example.txt

这将在完整的命名实体识别过程中运行TokensRegex子注释器。然后，在执行实体提及的最后一步时，它将对提取的规则命名实体进行操作，并从中创建实体提及。

我可以从Stanford CoreNLP中TokensRegex匹配的结果中获得一个entityMention吗？

1 个答案: