我从德国模式开始训练有素的新实体。为了避免灾难性的遗忘,我尝试了两种方法。
方法1:伪演练 正如spacy文档中所建议的那样,我尝试使用伪演练方法,并向更新功能提供了训练数据+修订数据(包含可识别的标准实体的数据)。看起来,在第一次更新迭代之后,模型会忘记标准实体,而只会识别新实体。您是否知道为什么会发生这种情况?
方法2:使用排练功能 作为第二种方法,我尝试使用排练功能将带有标准实体(不带注释)的文本作为排练数据。 文本如下所示:
[Gertrude liebt Katze, Paolo hat gerne Katzen, Anastasia hat gerne Nudeln]
调用排练功能时,出现以下错误:
File "test.py", line 226, in <module>
new_model = train_ner(nlp, TRAIN_DATA,rehearse_data, n_iter, LABEL, output_dir, model_name)
File "test.py", line 112, in train_ner
nlp.rehearse(set(raw_batch), sgd=optimizer, losses=r_losses)
File "/usr/local/lib/python3.6/dist-packages/spacy/language.py", line 503, in rehearse
proc.rehearse(docs, sgd=get_grads, losses=losses, **config.get(name, {}))
File "pipes.pyx", line 450, in spacy.pipeline.pipes.Tagger.rehearse
TypeError: unsupported operand type(s) for -: 'list' and 'list'
我在pipes.pyx中检查了演练功能,其中梯度的计算方式如下:
梯度=得分-目标
我当然不能从列表中减去列表,但是我想我不能正确理解为函数提供的排练数据的形式。 我还尝试以spacy likes形式提供带标签的彩排数据:
rehearse_data = [
("Uber blew through $1 million a week", [(0, 4, 'ORG')]),
("Android Pay expands to Canada", [(0, 11, 'PRODUCT'), (23, 30, 'GPE')]),
("Spotify steps up Asia expansion", [(0, 8, "ORG"), (17, 21, "LOC")]),
("Google Maps launches location sharing", [(0, 11, "PRODUCT")]),
("Google rebrands its business apps", [(0, 6, "ORG")]),
("look what i found on google! ", [(21, 27, "PRODUCT")])]
,并出现以下错误:
File "test.py", line 228, in <module>
new_model = train_ner(nlp, TRAIN_DATA,rehearse_data, n_iter, LABEL, output_dir, model_name)
File "test.py", line 109, in train_ner
raw_batch = [nlp.make_doc(text) for text in list(next(r_batches))]
File "test.py", line 109, in <listcomp>
raw_batch = [nlp.make_doc(text) for text in list(next(r_batches))]
File "/usr/local/lib/python3.6/dist-packages/spacy/language.py", line 401, in make_doc
return self.tokenizer(text)
TypeError: Argument 'string' has incorrect type (expected str, got tuple)
这是我的代码:
def train_ner(nlp, train_data, rehearse_data, niter, label, output_dir, model_name):
random.seed(0)
# Add entity recognizer to model if it's not in the pipeline
ner = nlp.get_pipe("ner")
ner.add_label(label) # add new entity label to entity recognizer
# Resume training: we want to train only the new entity
optimizer = nlp.resume_training()
sizes = compounding(1.0, 4.0, 1.001)
# batch up the examples using spaCy's minibatch
for itn in range(n_iter):
print("NITER", itn)
random.shuffle(train_data)
random.shuffle(rehearse_data)
#batches = minibatch(TRAIN_DATA, size=sizes)
#batches = get_batches(TRAIN_DATA,lang)
losses = {}
r_losses = {}
r_batches = minibatch(rehearse_data,size=4)
#for batch in batches:
for batch in minibatch(train_data, size=4):
print("BATCH",batch)
texts, annotations = zip(*batch)
print("TEXTS",type(texts))
nlp.update(texts, annotations, sgd=optimizer, drop=next(dropout), losses=losses)
raw_batch = [nlp.make_doc(text) for text in list(next(r_batches))]
print("RAW_BATCH",raw_batch)
nlp.rehearse(set(raw_batch), sgd=optimizer, losses=r_losses)
print("Losses", losses)
print("R. Losses", r_losses)
print(nlp.get_pipe('ner').model.unseen_classes)