我一直在训练我的NER模型上的一些文字,并尝试使用自定义实体在其中查找城市。
示例:-
('paragraph Designated Offices Party A New York Party B Delaware paragraph pricing source calculation Market Value shall generally accepted pricing source reasonably agreed parties paragraph Spot rate Spot Rate specified paragraph reasonably agreed parties',
{'entities': [(37, 41, 'DesignatedBankLoc'),(54, 62, 'CounterpartyBankLoc')]})
我正在DesignatedBankLoc
和CounterpartyBankLoc
处寻找2个实体。单个文本也可以有多个实体。
目前,我正在对60行数据进行如下训练:
import spacy
import random
def train_spacy(data,iterations):
TRAIN_DATA = data
nlp = spacy.blank('en') # create blank Language class
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
# print (ent[2])
ner.add_label(ent[2])
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(iterations):
print("Statring iteration " + str(itn))
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update(
[text], # batch of texts
[annotations], # batch of annotations
drop=0.5, # dropout - make it harder to memorise data
sgd=optimizer, # callable to update weights
losses=losses)
print(losses)
return nlp
prdnlp = train_spacy(TRAIN_DATA, 100)
我的问题是:-
当输入不同/相同的文本模式包含训练有素的城市时,模型预测正确。 即使文本的相同/不同模式但在训练数据集中从未发生过的不同城市,该模型也无法预测任何实体。
请建议我为什么会发生这种情况,请让我了解火车的概念吗?
答案 0 :(得分:1)
根据经验,您拥有60行数据并进行了100次迭代训练。您是根据实体的价值而不是其位置来过度拟合。
要对此进行检查,请尝试在句子中的任意位置插入城市名称,然后看看会发生什么。如果算法标记了它们,那么您可能会过度拟合。
有两种解决方案: