Question

在这里，我将动物列表存储在csv文件中，如下所示：
[“ cat”，“ dog”，“ fish”，“ bird” ...] 仅举几例。
这是一个句子示例：“我有一只猫。”

那我怎么能在这里直观显示匹配结果？
一个详细的代码示例将不胜感激！

Answer 1

使用

spaCy的内置displacy visualizer，您可以传入一个或多个Doc对象，并将突出显示所有实体，这些实体可以作为doc.ents属性使用。 doc.ents是可写的，因此您可以使用PhraseMatcher在文本中查找动物，为每个匹配项创建一个新的Span对象并将其添加到现有实体中。这是一个示例：

import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

animals = ['cat', 'dog', 'fish', 'bird']

nlp = spacy.load('en_core_web_sm')  # or any other model
patterns = [nlp(animal) for animal in animals]  # process each word to create phrase pattern
matcher = PhraseMatcher(nlp.vocab)
matcher.add('ANIMAL', None, *patterns)  # add patterns to matcher

doc = nlp("I have a cat")
matches = matcher(doc)

for match_id, start, end in matches:
    # create a new Span for each match and use the match_id (ANIMAL) as the label
    span = Span(doc, start, end, label=match_id)
    doc.ents = list(doc.ents) + [span]  # add span to doc.ents

print([(ent.text, ent.label_) for ent in doc.ents])  # [('cat', 'ANIMAL')]

您的Doc对象现在包含“ cat”的实体范围，因此，运行displaCy时，该实体将突出显示。有关更多详细信息，包括如何为实体添加自定义颜色，请参见the visualizers documentation。

from spacy import displacy
displacy.serve(doc, style='ent')

一个重要的注意事项：每个令牌只能是一个实体的一部分，因此，如果您有重叠的匹配项，或者如果您的匹配项与Doc上已经存在的实体发生冲突，则此方法将不起作用。您可以通过在matches上进行迭代时显式滤除重叠的跨度来防止这种情况。这会为您提供匹配的start和end标记，因此在将跨度添加到doc.ents之前，您可以检查是否存在与开始或结束重叠的实体位置。

对于更优雅的解决方案，您还可以将匹配器逻辑包装在custom pipeline component中。每次您使用nlp对象处理文本时，都会自动触发它。

如何将匹配项添加为实体并将其可视化显示？

1 个答案: