Question

我从德国模式开始训练有素的新实体。为了避免灾难性的遗忘，我尝试了两种方法。

方法1：伪演练正如spacy文档中所建议的那样，我尝试使用伪演练方法，并向更新功能提供了训练数据+修订数据（包含可识别的标准实体的数据）。看起来，在第一次更新迭代之后，模型会忘记标准实体，而只会识别新实体。您是否知道为什么会发生这种情况？

方法2：使用排练功能作为第二种方法，我尝试使用排练功能将带有标准实体（不带注释）的文本作为排练数据。文本如下所示：

[Gertrude liebt Katze, Paolo hat gerne Katzen,  Anastasia hat gerne Nudeln]

调用排练功能时，出现以下错误：

  File "test.py", line 226, in <module>
new_model = train_ner(nlp, TRAIN_DATA,rehearse_data, n_iter, LABEL, output_dir, model_name)
  File "test.py", line 112, in train_ner
nlp.rehearse(set(raw_batch), sgd=optimizer, losses=r_losses)
  File "/usr/local/lib/python3.6/dist-packages/spacy/language.py", line 503, in rehearse
    proc.rehearse(docs, sgd=get_grads, losses=losses, **config.get(name, {}))
  File "pipes.pyx", line 450, in spacy.pipeline.pipes.Tagger.rehearse
TypeError: unsupported operand type(s) for -: 'list' and 'list'

我在pipes.pyx中检查了演练功能，其中梯度的计算方式如下：

梯度=得分-目标

我当然不能从列表中减去列表，但是我想我不能正确理解为函数提供的排练数据的形式。我还尝试以spacy likes形式提供带标签的彩排数据：

rehearse_data = [
("Uber blew through $1 million a week", [(0, 4, 'ORG')]),
("Android Pay expands to Canada", [(0, 11, 'PRODUCT'), (23, 30, 'GPE')]),
("Spotify steps up Asia expansion", [(0, 8, "ORG"), (17, 21, "LOC")]),
("Google Maps launches location sharing", [(0, 11, "PRODUCT")]),
("Google rebrands its business apps", [(0, 6, "ORG")]),
("look what i found on google! ", [(21, 27, "PRODUCT")])]

，并出现以下错误：

  File "test.py", line 228, in <module>
    new_model = train_ner(nlp, TRAIN_DATA,rehearse_data, n_iter, LABEL, output_dir, model_name)
  File "test.py", line 109, in train_ner
raw_batch = [nlp.make_doc(text) for text in list(next(r_batches))]
  File "test.py", line 109, in <listcomp>
raw_batch = [nlp.make_doc(text) for text in list(next(r_batches))]
  File "/usr/local/lib/python3.6/dist-packages/spacy/language.py", line 401, in make_doc
return self.tokenizer(text)
TypeError: Argument 'string' has incorrect type (expected str, got tuple)

这是我的代码：

def train_ner(nlp, train_data, rehearse_data, niter, label, output_dir, model_name):

    random.seed(0)

    # Add entity recognizer to model if it's not in the pipeline

    ner = nlp.get_pipe("ner")
    ner.add_label(label)  # add new entity label to entity recognizer

    # Resume training: we want to train only the new entity 
    optimizer = nlp.resume_training()


    sizes = compounding(1.0, 4.0, 1.001)

    # batch up the examples using spaCy's minibatch
    for itn in range(n_iter):
        print("NITER", itn)
        random.shuffle(train_data)
        random.shuffle(rehearse_data)
        #batches = minibatch(TRAIN_DATA, size=sizes)

        #batches = get_batches(TRAIN_DATA,lang)
        losses = {}
        r_losses = {}

        r_batches = minibatch(rehearse_data,size=4)

        #for batch in batches:
        for batch in minibatch(train_data, size=4):
            print("BATCH",batch)

            texts, annotations = zip(*batch)
            print("TEXTS",type(texts))

            nlp.update(texts, annotations, sgd=optimizer, drop=next(dropout), losses=losses) 

            raw_batch = [nlp.make_doc(text) for text in list(next(r_batches))]
            print("RAW_BATCH",raw_batch)
            nlp.rehearse(set(raw_batch), sgd=optimizer, losses=r_losses)

        print("Losses", losses)
        print("R. Losses", r_losses)

    print(nlp.get_pipe('ner').model.unseen_classes)

spacy中的灾难性遗忘：排练功能

0 个答案: