spacy中的灾难性遗忘:排练功能

时间:2019-05-09 06:44:10

标签: spacy

我从德国模式开始训练有素的新实体。为了避免灾难性的遗忘,我尝试了两种方法。

方法1:伪演练 正如spacy文档中所建议的那样,我尝试使用伪演练方法,并向更新功能提供了训练数据+修订数据(包含可识别的标准实体的数据)。看起来,在第一次更新迭代之后,模型会忘记标准实体,而只会识别新实体。您是否知道为什么会发生这种情况?

方法2:使用排练功能 作为第二种方法,我尝试使用排练功能将带有标准实体(不带注释)的文本作为排练数据。 文本如下所示:

[Gertrude liebt Katze, Paolo hat gerne Katzen,  Anastasia hat gerne Nudeln]

调用排练功能时,出现以下错误:

  File "test.py", line 226, in <module>
new_model = train_ner(nlp, TRAIN_DATA,rehearse_data, n_iter, LABEL, output_dir, model_name)
  File "test.py", line 112, in train_ner
nlp.rehearse(set(raw_batch), sgd=optimizer, losses=r_losses)
  File "/usr/local/lib/python3.6/dist-packages/spacy/language.py", line 503, in rehearse
    proc.rehearse(docs, sgd=get_grads, losses=losses, **config.get(name, {}))
  File "pipes.pyx", line 450, in spacy.pipeline.pipes.Tagger.rehearse
TypeError: unsupported operand type(s) for -: 'list' and 'list'

我在pipes.pyx中检查了演练功能,其中梯度的计算方式如下:

梯度=得分-目标

我当然不能从列表中减去列表,但是我想我不能正确理解为函数提供的排练数据的形式。 我还尝试以spacy likes形式提供带标签的彩排数据:

rehearse_data = [
("Uber blew through $1 million a week", [(0, 4, 'ORG')]),
("Android Pay expands to Canada", [(0, 11, 'PRODUCT'), (23, 30, 'GPE')]),
("Spotify steps up Asia expansion", [(0, 8, "ORG"), (17, 21, "LOC")]),
("Google Maps launches location sharing", [(0, 11, "PRODUCT")]),
("Google rebrands its business apps", [(0, 6, "ORG")]),
("look what i found on google! ", [(21, 27, "PRODUCT")])]

,并出现以下错误:

  File "test.py", line 228, in <module>
    new_model = train_ner(nlp, TRAIN_DATA,rehearse_data, n_iter, LABEL, output_dir, model_name)
  File "test.py", line 109, in train_ner
raw_batch = [nlp.make_doc(text) for text in list(next(r_batches))]
  File "test.py", line 109, in <listcomp>
raw_batch = [nlp.make_doc(text) for text in list(next(r_batches))]
  File "/usr/local/lib/python3.6/dist-packages/spacy/language.py", line 401, in make_doc
return self.tokenizer(text)
TypeError: Argument 'string' has incorrect type (expected str, got tuple)

这是我的代码:

def train_ner(nlp, train_data, rehearse_data, niter, label, output_dir, model_name):

    random.seed(0)

    # Add entity recognizer to model if it's not in the pipeline

    ner = nlp.get_pipe("ner")
    ner.add_label(label)  # add new entity label to entity recognizer

    # Resume training: we want to train only the new entity 
    optimizer = nlp.resume_training()


    sizes = compounding(1.0, 4.0, 1.001)

    # batch up the examples using spaCy's minibatch
    for itn in range(n_iter):
        print("NITER", itn)
        random.shuffle(train_data)
        random.shuffle(rehearse_data)
        #batches = minibatch(TRAIN_DATA, size=sizes)

        #batches = get_batches(TRAIN_DATA,lang)
        losses = {}
        r_losses = {}

        r_batches = minibatch(rehearse_data,size=4)

        #for batch in batches:
        for batch in minibatch(train_data, size=4):
            print("BATCH",batch)

            texts, annotations = zip(*batch)
            print("TEXTS",type(texts))

            nlp.update(texts, annotations, sgd=optimizer, drop=next(dropout), losses=losses) 

            raw_batch = [nlp.make_doc(text) for text in list(next(r_batches))]
            print("RAW_BATCH",raw_batch)
            nlp.rehearse(set(raw_batch), sgd=optimizer, losses=r_losses)

        print("Losses", losses)
        print("R. Losses", r_losses)

    print(nlp.get_pipe('ner').model.unseen_classes)

0 个答案:

没有答案