Question

我目前正在尝试训练围绕“属性描述”为中心的NER模型。我可以得到一个经过全面训练的模型，以实现自己的喜好，但是，我现在想在模型中添加重新标记管道，以便可以设置模型来训练其他事物。

在这里，我遇到了使重新令牌化管道真正起作用的问题。这是定义：

def retok(doc):
    ents = [(ent.start, ent.end, ent.label) for ent in doc.ents]
    with doc.retokenize() as retok:
        string_store = doc.vocab.strings
    for start, end, label in ents:
        retok.merge(
                doc[start: end],
                attrs=intify_attrs({'ent_type':label},string_store))
    return doc

我将其添加到我的训练中，如下所示：

nlp.add_pipe(retok, after="ner")

我将其添加到语言工厂中，如下所示：

Language.factories['retok'] = lambda nlp, **cfg: retok(nlp)

我一直遇到的问题是“ AttributeError：“英语”对象没有属性“ ents””。现在，我假设我收到此错误，因为通过此函数传递的参数不是文档，而是实际上是NLP模型本身。我真的不确定在培训期间是否有文档可以流入该管道。在这一点上，我真的不知道该从哪里去使管道按我想要的方式工作。

感谢您的任何帮助，谢谢。

Answer 1

您可以潜在地使用内置的merge_entities管道组件：https://spacy.io/api/pipeline-functions#merge_entities

从文档复制的示例：

texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David", "Bowie"]

merge_ents = nlp.create_pipe("merge_entities")
nlp.add_pipe(merge_ents)

texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David Bowie"]

如果您需要进一步对其进行自定义，那么merge_entities（v2.2）的当前实现是一个很好的起点：

def merge_entities(doc):
    """Merge entities into a single token.

    doc (Doc): The Doc object.
    RETURNS (Doc): The Doc object with merged entities.

    DOCS: https://spacy.io/api/pipeline-functions#merge_entities
    """
    with doc.retokenize() as retokenizer:
        for ent in doc.ents:
            attrs = {"tag": ent.root.tag, "dep": ent.root.dep, "ent_type": ent.l
abel}
            retokenizer.merge(ent, attrs=attrs)
    return doc

P.S。您要将nlp传递到下面的retok()，这是错误的出处：

Language.factories['retok'] = lambda nlp, **cfg: retok(nlp)

查看相关问题：Spacy - Save custom pipeline

在训练NER模型时添加Retokenize管道

1 个答案: