Question

我正在使用Spacy进行NLP项目。我有一个我想标记为新实体类型的短语列表。我最初尝试训练NER模型，但由于有一个有限的术语列表，我认为简单地使用Matcher应该更容易。我在documentation中看到，您可以根据匹配器向文档添加实体。我的问题是：如何为 new 实体执行此操作，而不将NER管道标签作为此实体的任何其他令牌？理想情况下，只有通过我的匹配器找到的标记应该被标记为实体，但我需要将其作为标签添加到NER模型中，然后最终将某些标记标记为实体。

有关如何最好地完成此任务的任何建议？谢谢！

Answer 1

我认为您可能希望实现与this example类似的内容 - 即custom pipeline component使用PhraseMatcher并分配实体。 spaCy的内置实体识别器也只是一个管道组件 - 因此您可以将其从管道中删除并添加自定义组件：

nlp = spacy.load('en')               # load some model
nlp.remove_pipe('ner')               # remove the entity recognizer
entity_matcher = EntityMatcher(nlp)  # use your own entity matcher component
nlp.add_pipe(entity_matcher)         # add it to the pipeline

您的实体匹配器组件可能看起来像这样：

from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

class EntityMatcher(object):
    name = 'entity_matcher'

    def __init__(self, nlp, terms, label):
        patterns = [nlp(term) for term in terms]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add(label, None, *patterns)

    def __call__(self, doc):
        matches = self.matcher(doc)
        spans = []
        for label, start, end in matches:
            span = Span(doc, start, end, label=label)
            spans.append(span)
        doc.ents = spans
        return doc

初始化组件时，它会为您的术语创建匹配模式，并将它们添加到短语匹配器中。我的示例假设您要为这些字词分配terms和label的列表：

entity_matcher = EntityMatcher(nlp, your_list_of_terms, 'SOME_LABEL')
nlp.add_pipe(entity_matcher)

print(nlp.pipe_names)  # see all components in the pipeline

当您在文本字符串上调用nlp时，spaCy会对文本文本进行标记，以创建Doc对象并按顺序调用Doc上的各个管道组件。然后，您的自定义组件的__call__方法会在文档中找到匹配项，为每个匹配项创建一个Span（允许您分配自定义标签），最后将它们添加到{{ 1}} property并返回doc.ents。

您可以根据需要构建管道组件 - 例如，您可以将其扩展为从文件中加载到术语列表中，或者将不同标签的多个规则添加到Doc。

Answer 2

从spacy v2.1开始，有一个spacy提供了一个现成的解决方案：EntityRuler类。

要仅在您关注的实体上进行匹配，您可以在添加自定义组件之前禁用实体管道组件，或实例化空语言模型并将自定义组件添加到其中，例如

from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()

my_patterns = [{"label": "ORG", "pattern": "spacy team"}]
ruler = EntityRuler(nlp)
ruler.add_patterns(my_patterns)
nlp.add_pipe(ruler)

doc = nlp("The spacy team are amazing!")
assert str(doc.ents[0]) == 'spacy team'

如果您要做的只是标记文档和术语列表中的完全匹配实体，则实例化空语言模型可能是最简单的解决方案。

仅来自PhraseMatcher的Spacy实体

2 个答案: