数字之前或之后的度量单位上的spacy规则匹配器

时间:2019-11-26 16:10:29

标签: python nlp spacy

我是spacy的新手,我正在尝试在某些文本中匹配一些度量。我的问题是度量单位有时在值之前,有时在值之后。在某些其他情况下,其名称有所不同。这是一些代码:

nlp = spacy.load('en_core_web_sm')

# case 1:
text = "the surface is 31 sq"
# case 2:
# text = "the surface is sq 31"
# case 3:
# text = "the surface is square meters 31"
# case 4:
# text = "the surface is 31 square meters"
# case 5:
# text = "the surface is about 31 square meters"
# case 6:
# text = "the surface is 31 kilograms"

pattern = [
    {"IS_STOP": True}, 
    {"LOWER": "surface"}, 
    {"LEMMA": "be", "OP": "?"}, 
    {"LOWER": "sq", "OP": "?"},
    {"LOWER": "square", "OP": "?"},
    {"LOWER": "meters", "OP": "?"},
    {"IS_DIGIT": True}, 
    {"LOWER": "square", "OP": "?"},
    {"LOWER": "meters", "OP": "?"},
    {"LOWER": "sq", "OP": "?"} 
]

doc = nlp(text)

matcher = Matcher(nlp.vocab) 

matcher.add("Surface", None, pattern)

matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

我有两个问题: 1-模式应该能够匹配所有情况1到5,但在我的情况1中,输出为

4898162435462687487 Surface 0 4 the surface is 31
4898162435462687487 Surface 0 5 the surface is 31 sq 

在我看来,这是重复的比赛。

2-情况6不应该匹配,而是与我的模式匹配。 关于如何改善这一点有什么建议吗?

编辑: 是否可以在模式内建立OR条件?像

pattern = [
    {"POS": "DET", "OP": "?"}, 
    {"LOWER": "surface"}, 
    {"LEMMA": "be", "OP": "?"},  
    [
      [{"LOWER": "sq", "OP": "?"},
      {"LOWER": "square", "OP": "?"},
      {"LOWER": "meters", "OP": "?"},
      {"IS_ALPHA": True, "OP": "?"},
      {"LIKE_NUM": True}]
     OR
      [{"LIKE_NUM": True},
      {"LOWER": "square", "OP": "?"},
      {"LOWER": "meters", "OP": "?"},
      {"LOWER": "sq", "OP": "?"} ]
    ]
]

1 个答案:

答案 0 :(得分:1)

您不能使用类似的OR,但可以为同一标签定义单独的模式。因此,您需要两种模式,一种将与sqsquaremeters的数字匹配,或者将前面的这些单词组合,另一种模式将数字与至少这些话之一。

代码段:

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")

texts = ["the surface is 31 sq", "the surface is sq 31", "the surface is square meters 31",
     "the surface is 31 square meters", "the surface is about 31 square meters", "the surface is 31 kilograms"]
pattern1 = [
      {"IS_STOP": True}, 
      {"LOWER": "surface"}, 
      {"LEMMA": "be", "OP": "?"}, 
      {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"},
      {"LIKE_NUM": True}
    ]
pattern2 = [
      {"IS_STOP": True}, 
      {"LOWER": "surface"}, 
      {"LEMMA": "be", "OP": "?"}, 
      {"IS_ALPHA": True, "OP": "?"},
      {"LIKE_NUM": True},
      {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"}
    ]

matcher = Matcher(nlp.vocab, validate=True)
matcher.add("Surface", None, pattern1)
matcher.add("Surface", None, pattern2)

for text in texts:
  doc = nlp(text)
  matches = matcher(doc)
  for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

输出:

4898162435462687487 Surface 0 5 the surface is 31 sq
4898162435462687487 Surface 0 5 the surface is sq 31
4898162435462687487 Surface 0 6 the surface is square meters 31
4898162435462687487 Surface 0 5 the surface is 31 square
4898162435462687487 Surface 0 6 the surface is about 31 square

{"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"}部分匹配一个或多个与正则表达式匹配的令牌(由于"OP": "+"):

  • ^-令牌的开始
  • (?i:-不区分大小写的修饰符组的开始:
    • sq(?:uare)?-sqsquare
    • |-或
    • m(?:et(?:er|re)s?)?-mmeter / metremeters / metres
  • )-组结束
  • $-字符串的结尾(此处为令牌)。