基于空令牌的匹配,令牌之间令牌数量为n个

时间:2019-11-26 11:49:14

标签: python nlp spacy

我正在使用spacy来匹配某些文本中的特定表达(意大利语)。我的文字可能以多种形式出现,我正在尝试学习编写通用规则的最佳方法。我有以下4种情况,我想写一个适用于所有情况的一般模式。像这样:

# case 1
text = 'Superfici principali e secondarie: 90 mq'
# case 2
# text = 'Superfici principali e secondarie di 90 mq'
# case 3
# text = 'Superfici principali e secondarie circa 90 mq'
# case 4
# text = 'Superfici principali e secondarie di circa 90 mq'

nlp = spacy.load('it_core_news_sm')
doc = nlp(text)

matcher = Matcher(nlp.vocab) 

pattern = [{"LOWER": "superfici"}, {"LOWER": "principali"}, {"LOWER": "e"}, {"LOWER": "secondarie"},  << "some token here that allows max 3 tokens or a IS_PUNCT or nothing at all" >>, {"IS_DIGIT": True}, {"LOWER": "mq"}]

matcher.add("Superficie", None, pattern)

matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

1 个答案:

答案 0 :(得分:1)

您可以添加一个{"IS_PUNCT": True, "OP": "?"}可选令牌,然后添加三个可选IS_ALPHA令牌:

pattern = [
            {"LOWER": "superfici"}, 
            {"LOWER": "principali"},
            {"LOWER": "e"},
            {"LOWER": "secondarie"},
            {"IS_PUNCT": True, "OP": "?"},
            {"IS_ALPHA": True, "OP": "?"},
            {"IS_ALPHA": True, "OP": "?"},
            {"IS_ALPHA": True, "OP": "?"},
            {"IS_DIGIT": True},
            {"LOWER": "mq"}
          ]

"OP" : "?"表示令牌可以重复1或0次,即令牌只能出现一次或丢失。