在使用自定义训练数据训练spacy
NER模型时,出现以下错误。
ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means the GoldParse was not correct. For example, are all labels added to the model?
有人可以帮我吗?
答案 0 :(得分:1)
通过下面的此功能传递训练数据可以正常工作,没有任何错误。
def trim_entity_spans(data: list) -> list:
"""Removes leading and trailing white spaces from entity spans.
Args:
data (list): The data to be cleaned in spaCy JSON format.
Returns:
list: The cleaned data.
"""
invalid_span_tokens = re.compile(r'\s')
cleaned_data = []
for text, annotations in data:
entities = annotations['entities']
valid_entities = []
for start, end, label in entities:
valid_start = start
valid_end = end
while valid_start < len(text) and invalid_span_tokens.match(
text[valid_start]):
valid_start += 1
while valid_end > 1 and invalid_span_tokens.match(
text[valid_end - 1]):
valid_end -= 1
valid_entities.append([valid_start, valid_end, label])
cleaned_data.append([text, {'entities': valid_entities}])
return cleaned_data
答案 1 :(得分:1)
当注释中的内容(数据)为空时,会发生这种情况。空数据的示例可能包括标签,标签,标签的起点和终点。上面提供的解决方案应适用于修剪/清理数据。但是,如果您要使用蛮力方法,只需在更新模型之前包括一个异常处理程序,如下所示:
def train_spacy(data,iterations):
nlp = spacy.blank('en') # create blank Language class
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
#add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(iterations):
print("Starting iteration " + str(itn))
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
try:
nlp.update(
[text],
[annotations],
drop=0.2,
sgd=optimizer,
losses=losses)
except Exception as error:
print(error)
continue
print(losses)
return nlp
因此,假设您的TRAIN_DATA包含1000行,而只有第200行具有空数据,则模型将始终跳过第200行并训练其余数据,而不是模型抛出错误,而不会抛出错误。