应用错误收集

如何将培训计划分为培训和测试？（蟒蛇）

时间：2020-07-11 01:19:49

标签： python nlp spacy training-data ner

我正在使用经过训练的Spacy算法来识别文本中的命名实体。在python中，我有一个类似于示例的数据集：

TRAIN_DATA = [
('Estado de Mato Grosso do Sul', {
    'entities': [(0, 28, 'LOC')]
    }),
('Poder Judiciario', {
    'entities': [(0, 16, 'ORG')]
    }), 
('Campo Grande', {
    'entities': [(0, 12, 'LOC')]
    }),
('Exequente: Fundo de Investimento em Direitos Creditérios Multsegmentos NPL', {
    'entities': [(11, 74, 'MISC')]
    }),
('Ipanema VI - Nao Padronizado', {
    'entities': [(0, 10, 'LOC')]
    }),
...

数据集继续... 在进行如下培训之后：

# add labels
    for _, annotations in TRAIN_DATA:
        for ent in annotations.get('entities'):
            ner.add_label(ent[2])

# get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in tqdm(TRAIN_DATA):
                nlp.update(
                    [text],  # batch of texts
                    [annotations],  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses)
            print(losses)

# test the trained model
    for text, _ in TRAIN_DATA:
        doc = nlp(text)
        print('Entities', [(ent.text, ent.label_) for ent in doc.ents])

粗体斜体

我假设正在同一数据集中执行培训和测试。那不是理想的，我想分享我的数据集的训练和测试。还是创建一个仅用于测试的数据集，但是我该如何使用以前训练过的相同算法？我该怎么做两种方法之一？

0 个答案:

没有答案