模式的行为不符合预期

时间:2019-09-08 16:57:48

标签: stanford-nlp tokenize

实际的模式不是用英语编写的,因此我创建了这个简化的示例来重现该问题:有3个级别的注释(对于实际应用是必需的),第3个级别的模式无法按预期工作。 要识别的短语是: a b c

我期望的是

  • 第一级:“ a”标注为A,“ b”标注为“ B”
  • 2nd:如果有注释A和B,则将它们全部注释为AB
  • 3rd:如果存在至少一个注释AB,并且有单词“ c”,则将它们全部注释为C 模式如下所示。
# 1.
{  pattern: (/a/), action: (Annotate($0, name, "A")) }
{  pattern: (/b/), action: (Annotate($0, name, "B")) }
# 2.
{  pattern: (([name:A]) ([name:B])), action: (Annotate($0, name, "AB")) }
# 3.
{  pattern: (([name:AB]+) /c/), action: (Annotate($0, name, "C")) }

#1和#2的作品以及“ a b”都带有注释: 匹配的令牌:NamedEntitiesToken {word ='a'name ='AB'beginPosition = 0 endPosition = 1} 匹配的令牌:NamedEntitiesToken {word ='b'name ='AB'beginPosition = 2 endPosition = 3} 但是,即使有人看到我们有2个带有“ AB”注释的标记,#3模式也不起作用,而这正是#3模式所期望的。 如果我将#1更改为

{  pattern: (/a/), action: (Annotate($0, name, "AB")) }
{  pattern: (/b/), action: (Annotate($0, name, "AB")) }

模式#3正常工作: 匹配的令牌:NamedEntitiesToken {word ='a'name ='C'beginPosition = 0 endPosition = 1} 匹配的令牌:NamedEntitiesToken {word ='b'name ='C'beginPosition = 2 endPosition = 3} 匹配的令牌:NamedEntitiesToken {word ='c'name ='C'beginPosition = 4 endPosition = 5}

使用时我找不到匹配的令牌之间的任何区别

# In this case #3 pattern works
{  pattern: (/a/), action: (Annotate($0, name, "AB")) }
{  pattern: (/b/), action: (Annotate($0, name, "AB")) }

或当我使用

# In this case #3 pattern doesn't work
# 1.
{  pattern: (/a/), action: (Annotate($0, name, "A")) }
{  pattern: (/b/), action: (Annotate($0, name, "B")) }
# 2.
{  pattern: (([name:A]) ([name:B])), action: (Annotate($0, name, "AB")) }

在两种情况下,我都得到相同的注释,但是第一种情况有效,第二种情况无效。 我在做什么错了?

1 个答案:

答案 0 :(得分:0)

这对我有用:

# these Java classes will be used by the rules
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }

ENV.defaults["stage"] = 1

{ ruleType: "tokens", pattern: (/a/), action: Annotate($0, ner, "A") }
{ ruleType: "tokens", pattern: (/b/), action: Annotate($0, ner, "B") }

ENV.defaults["stage"] = 2

{ ruleType: "tokens", pattern: ([{ner: "A"}] [{ner: "B"}]), action: Annotate($0, ner, "AB") }

ENV.defaults["stage"] = 3

{ ruleType: "tokens", pattern: ([{ner: "AB"}]+ /c/), action: Annotate($0, ner, "ABC") }

这里有关于TokensRegex的文章:

https://stanfordnlp.github.io/CoreNLP/tokensregex.html