训练自定义NER模型

时间:2019-12-03 06:44:41

标签: python machine-learning nltk spacy ner

我一直在训练我的NER模型上的一些文字,并尝试使用自定义实体在其中查找城市。

示例:-

    ('paragraph Designated Offices Party A New York Party B Delaware paragraph pricing source calculation Market Value shall generally accepted pricing source reasonably agreed parties paragraph Spot rate Spot Rate specified paragraph reasonably agreed parties',
  {'entities': [(37, 41, 'DesignatedBankLoc'),(54, 62, 'CounterpartyBankLoc')]})

我正在DesignatedBankLocCounterpartyBankLoc处寻找2个实体。单个文本也可以有多个实体。

目前,我正在对60行数据进行如下训练:

import spacy
import random
def train_spacy(data,iterations):
    TRAIN_DATA = data
    nlp = spacy.blank('en')  # create blank Language class
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)


    # add labels
    for _, annotations in TRAIN_DATA:
         for ent in annotations.get('entities'):
            # print (ent[2])
            ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(iterations):
            print("Statring iteration " + str(itn))
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in TRAIN_DATA:
                nlp.update(
                    [text],  # batch of texts
                    [annotations],  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses)
            print(losses)
    return nlp


prdnlp = train_spacy(TRAIN_DATA, 100)

我的问题是:-

当输入不同/相同的文本模式包含训练有素的城市时,模型预测正确。 即使文本的相同/不同模式但在训练数据集中从未发生过的不同城市,该模型也无法预测任何实体。

请建议我为什么会发生这种情况,请让我了解火车的概念吗?

1 个答案:

答案 0 :(得分:1)

根据经验,您拥有60行数据并进行了100次迭代训练。您是根据实体的价值而不是其位置来过度拟合。

要对此进行检查,请尝试在句子中的任意位置插入城市名称,然后看看会发生什么。如果算法标记了它们,那么您可能会过度拟合。

有两种解决方案:

  • 为这些实体创建更多具有不同价值的训练数据
  • 测试不同的迭代次数