rule-based pattern matching in spaCy返回匹配ID以及匹配范围的开始和结束字符,但是我在文档中看不到任何内容说明如何确定该范围的哪些部分构成了令牌被匹配。
在正则表达式中,我可以将括号放在组周围以选择它们,并使其“被选中”并退出模式。 spaCy是否可以?
例如,我有这段文字(来自德古拉):
他们穿着高筒靴,裤子扎进去,留着长长的黑发和沉重的黑色胡子。
我已经定义了一个实验:
import spacy
from spacy.matcher import Matcher
def test_match(text, patterns):
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
matcher.add('Boots', None, patterns)
doc = nlp(text)
matches = matcher(doc)
for match in matches:
match_id, start, end = match
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(match, span.text)
text_a = "They wore high boots, with their trousers tucked into them, " \
"and had long black hair and heavy black moustaches."
patterns = [
{'POS': 'PRON'},
{'TAG': 'VBD'},
{'POS': 'ADJ'},
{'TAG': 'NNS'}
]
test_match(text_a, patterns)
这将输出:
(18231591219755621867, 0, 4) They wore high boots
对于像这样的简单模式,连续有四个标记,我可以假设标记0是代词,标记1是过去时动词,等等。但是对于带有数量修饰符的模式,它变得模棱两可。但是有可能spaCy告诉我哪些标记实际上与模式的组成部分匹配?
例如,将上述修改添加到上面的实验中,在模式中使用两个通配符,并且新版本的文本缺少形容词“ high”:
text_b = "They wore boots, with their trousers tucked into them, " \
"and had long black hair and heavy black moustaches."
patterns = [
{'POS': 'PRON'},
{'TAG': 'VBD'},
{'POS': 'ADJ', 'OP': '*'},
{'TAG': 'NNS', 'OP': '*'}
]
test_match(text_a, patterns)
print()
test_match(text_b, patterns)
哪个输出:
(18231591219755621867, 0, 2) They wore
(18231591219755621867, 0, 3) They wore high
(18231591219755621867, 0, 4) They wore high boots
(18231591219755621867, 0, 2) They wore
(18231591219755621867, 0, 3) They wore boots
在两种输出情况下,都不清楚最后一个标记中的哪个是形容词,哪个是复数名词。我想我可以遍历范围中的标记,然后手动匹配模式的搜索部分,但这肯定是重复的。既然我认为spaCy必须找到匹配的它们,难道不能只是告诉我是哪个?
答案 0 :(得分:5)
从 spaCy v3.06 开始,现在可以将匹配对齐信息作为匹配元组 (api doc link) 的一部分获取。
matches = matcher(doc, with_alignments=True)
在您的示例中,它将生成以下输出:
(1618900948208871284, 0, 2, [0, 1]) They wore
(1618900948208871284, 0, 3, [0, 1, 2]) They wore high
(1618900948208871284, 0, 4, [0, 1, 2, 3]) They wore high boots
(1618900948208871284, 0, 2, [0, 1]) They wore
(1618900948208871284, 0, 3, [0, 1, 3]) They wore boots