我正在使用经过训练的Spacy算法来识别文本中的命名实体。 在python中,我有一个类似于示例的数据集:
TRAIN_DATA = [
('Estado de Mato Grosso do Sul', {
'entities': [(0, 28, 'LOC')]
}),
('Poder Judiciario', {
'entities': [(0, 16, 'ORG')]
}),
('Campo Grande', {
'entities': [(0, 12, 'LOC')]
}),
('Exequente: Fundo de Investimento em Direitos Creditérios Multsegmentos NPL', {
'entities': [(11, 74, 'MISC')]
}),
('Ipanema VI - Nao Padronizado', {
'entities': [(0, 10, 'LOC')]
}),
...
数据集继续... 在进行如下培训之后:
# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in tqdm(TRAIN_DATA):
nlp.update(
[text], # batch of texts
[annotations], # batch of annotations
drop=0.5, # dropout - make it harder to memorise data
sgd=optimizer, # callable to update weights
losses=losses)
print(losses)
# test the trained model
for text, _ in TRAIN_DATA:
doc = nlp(text)
print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
粗体 斜体
我假设正在同一数据集中执行培训和测试。那不是理想的,我想分享我的数据集的训练和测试。还是创建一个仅用于测试的数据集,但是我该如何使用以前训练过的相同算法? 我该怎么做两种方法之一?