Question

在使用自定义训练数据训练spacy NER模型时，出现以下错误。

ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means the GoldParse was not correct. For example, are all labels added to the model?

有人可以帮我吗？

Answer 1

通过下面的此功能传递训练数据可以正常工作，没有任何错误。

def trim_entity_spans(data: list) -> list:
    """Removes leading and trailing white spaces from entity spans.

    Args:
        data (list): The data to be cleaned in spaCy JSON format.

    Returns:
        list: The cleaned data.
    """
    invalid_span_tokens = re.compile(r'\s')

    cleaned_data = []
    for text, annotations in data:
        entities = annotations['entities']
        valid_entities = []
        for start, end, label in entities:
            valid_start = start
            valid_end = end
            while valid_start < len(text) and invalid_span_tokens.match(
                    text[valid_start]):
                valid_start += 1
            while valid_end > 1 and invalid_span_tokens.match(
                    text[valid_end - 1]):
                valid_end -= 1
            valid_entities.append([valid_start, valid_end, label])
        cleaned_data.append([text, {'entities': valid_entities}])

    return cleaned_data

Answer 2

当注释中的内容（数据）为空时，会发生这种情况。空数据的示例可能包括标签，标签，标签的起点和终点。上面提供的解决方案应适用于修剪/清理数据。但是，如果您要使用蛮力方法，只需在更新模型之前包括一个异常处理程序，如下所示：

def train_spacy(data,iterations):
    nlp = spacy.blank('en')  # create blank Language class
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True) 

    #add labels
    for _, annotations in TRAIN_DATA:
          for ent in annotations.get('entities'):
            ner.add_label(ent[2])
          
    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(iterations):
            print("Starting iteration " + str(itn))
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in TRAIN_DATA:
                try:
                    nlp.update(
                        [text],  
                        [annotations],  
                        drop=0.2,  
                        sgd=optimizer,  
                        losses=losses)
                except Exception as error:
                    print(error)
                    continue
            print(losses)
    return nlp

因此，假设您的TRAIN_DATA包含1000行，而只有第200行具有空数据，则模型将始终跳过第200行并训练其余数据，而不是模型抛出错误，而不会抛出错误。

ValueError：[E024]找不到监督分析器的最佳方法

2 个答案: