SpaCy自定义NER模型:依赖解析器训练错误

时间:2020-06-08 08:46:47

标签: python parsing nlp spacy ner

我试图使用spacy构建自定义NER模型。在为实体建立模型之后,有必要训练依赖解析器的模型。 我尝试遵循以下Spacy网站上提供的示例代码:https://spacy.io/usage/training#tagger-parser

SpaCy网站上提供的培训数据的示例代码为:

TRAIN_DATA = [
(
    "They trade mortgage-backed securities.",
    {
        "heads": [1, 1, 4, 4, 5, 1, 1],
        "deps": ["nsubj", "ROOT", "compound", "punct", "nmod", "dobj", "punct"],
    },
)]

在此示例代码中,对于训练数据,有一个名为“ heads” 的标签。我并不是很清楚它的确切含义,它在代码中的意义是什么。

我尝试运行训练数据中没有“ heads”标签的模型。训练数据的示例是:

TRAIN_PARSER = ('Mr Manjunath who is in-charge of the motor at their Goa location.', {'deps': ['compound',    'ROOT',    'nsubj',    'relcl',    'prep',    'punct',    'pobj',    'prep',    'det',    'pobj',    'prep',    'poss', 'compound','pobj', 'punct']})

当我尝试运行下面没有给出heads标签的模型时:

from __future__ import unicode_literals, print_function

import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding


# training data
TRAIN_DATA = TRAIN_PARSER


@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int),
)
def main(model='model1', output_dir='model2', n_iter=74):
"""Load the model, set up the pipeline and train the parser."""
if model is not None:
    nlp = spacy.load(model)  # load existing spaCy model
    print("Loaded model '%s'" % model)
else:
    nlp = spacy.blank("en")  # create blank Language class
    print("Created blank 'en' model")

# add the parser to the pipeline if it doesn't exist
# nlp.create_pipe works for built-ins that are registered with spaCy
if "parser" not in nlp.pipe_names:
    parser = nlp.create_pipe("parser")
    nlp.add_pipe(parser, first=True)
# otherwise, get it, so we can add labels to it
else:
    parser = nlp.get_pipe("parser")

# add labels to the parser
for _, annotations in TRAIN_DATA:
    for dep in annotations.get('deps', []):
        parser.add_label(dep)

# get names of other pipes to disable them during training
pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes):  # only train parser
    optimizer = nlp.begin_training()
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, losses=losses)
        print("Losses", losses)

# test the trained model
test_text = "I like securities."
doc = nlp(test_text)
print("Dependencies", [(t.text, t.dep_, t.head.text) for t in doc])

# save model to output directory
if output_dir is not None:
    output_dir = Path(output_dir)
    if not output_dir.exists():
        output_dir.mkdir()
    nlp.to_disk(output_dir)
    print("Saved model to", output_dir)

    # test the saved model
    print("Loading from", output_dir)
    nlp2 = spacy.load(output_dir)
    doc = nlp2(test_text)
    print("Dependencies", [(t.text, t.dep_, t.head.text) for t in doc])

    main(model='model1', output_dir='model2', n_iter=74)

我收到以下错误:

IndexError: list index out of range

有人可以向我解释一下,这里的确切问题是什么,我该如何解决?另外,如何为训练数据生成“标头”标签?

1 个答案:

答案 0 :(得分:0)

需要heads信息来标识令牌的直接“父级”在树中是什么。例如,在

"I like London and Berlin.",
        {
            "heads": [1, 1, 1, 2, 2, 1],
            "deps": ["nsubj", "ROOT", "dobj", "cc", "conj", "punct"],
        },

单词I的开头为索引1,即单词like,并与依存关系nsubj连接。

有关该术语的更多信息,请参见spaCy文档:https://spacy.io/usage/linguistic-features#navigating