我试图用EntityRuler查找FRT
实体,如下所示:
from spacy.lang.en import English
from spacy.pipeline import EntityRuler
nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "FRT", "pattern": [{'REGEX': "[Aa]ppl[e|es])"}]},
{"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
doc = nlp(u"Apple is red. Granny Smith apples are green.")
print([(ent.text, ent.label_) for ent in doc.ents])
然后我得到了这个结果
[('Apple', 'FRT'), ('is', 'FRT'), ('red', 'FRT'), ('.', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT'), ('is', 'FRT'), ('green', 'FRT'), ('.', 'FRT')]
能否请您告诉我如何修复代码,以便获得此结果
[('Apple', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT')]
谢谢。
答案 0 :(得分:4)
您错过了要在正则表达式中尝试匹配的顶级令牌属性。由于缺少最高杠杆令牌属性,因此忽略了REGEX键,并且该模式被解释为“任何令牌”
from spacy.lang.en import English
from spacy.pipeline import EntityRuler
nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "FRT", "pattern": [{'TEXT' : {'REGEX': "[Aa]ppl[e|es]"}}]},
{"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
doc = nlp(u"Apple is red. Granny Smith apples are green.")
print([(ent.text, ent.label_) for ent in doc.ents])
输出
[('Apple', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT')]
事实上,您也可以将波纹管模式用于apple
{"label": "FRT", "pattern": [{'LOWER' : {'REGEX': "appl[e|es]"}}]}
答案 1 :(得分:2)
您需要使用以下patterns
声明来修复整个代码:
patterns = [{"label": "FRT", "pattern": [{"TEXT" : {"REGEX": "[Aa]pples?"}}]},
{"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]
有两件事:1)如果您未在REGEX
,TEXT
等top-level token下定义,LOWER
运算符本身将不起作用。 ),因为您正在使用字符类而不是分组构造,因此您正在使用的正则表达式已损坏。
请注意,[e|es]
是regex character class,匹配e
,s
或|
。因此,如果您有一个Appl| is red.
字符串,则结果将包含[('Appl|', 'FRT')
。您需要使用non-capturing group-(?:es|s)
,或者只使用与es?
和可选的e
相匹配的s
。
另外,请参阅。这些情况:
[{"TEXT" : {"REGEX": "[Aa]pples?"}}]
将找到Apple
,apple
,Apples
,apples
,但找不到APPLES
[{"LOWER" : {"REGEX": "apples?"}}]
将找到Apple
,apple
,Apples
,apples
,APPLES
,aPPleS
等。并且也stapples
(拼写错误的staples
)[{"TEXT" : {"REGEX": r"\b[Aa]pples?\b"}}]
将找到Apple
,apple
,Apples
,apples
,但找不到APPLES
,也不会 stapples
,因为\b
是单词边界。