Spacy:如何确定过度拟合的参数?

时间:2019-08-21 17:35:42

标签: python neural-network nlp spacy

要训练新的自定义实体,我们可以使用此处提到的步骤来训练模型:https://spacy.io/usage/training#ner

但是我想知道如何决定不进行迭代,丢弃和批量大小以过度拟合或欠拟合模型?

One example of loss is:
Starting training....
Losses:  {'ner': 3875.2103796127717}
Losses:  {'ner': 3091.347521599567}
Losses:  {'ner': 2811.074334355512}
Losses:  {'ner': 2235.2944185569686}
Losses:  {'ner': 2015.7072019365773}
Losses:  {'ner': 1647.0052678292357}
Losses:  {'ner': 1746.1746172501762}
Losses:  {'ner': 1350.2094295662862}
Losses:  {'ner': 1302.3405612718204}
Losses:  {'ner': 1322.3590930188122}
Losses:  {'ner': 1070.3760899125737}
Losses:  {'ner': 990.9221824283309}
Losses:  {'ner': 961.2431416302175}
Losses:  {'ner': 885.3743390914278}
Losses:  {'ner': 838.3100930655886}
Losses:  {'ner': 733.5780730531789}
Losses:  {'ner': 915.0732067395388}
Losses:  {'ner': 734.7598118888878}
Losses:  {'ner': 645.5447305966479}
Losses:  {'ner': 615.6987186405088}
Losses:  {'ner': 624.112212173154}
Losses:  {'ner': 590.4118676242763}
Losses:  {'ner': 411.8125225993247}
Losses:  {'ner': 482.4468110898493}
Losses:  {'ner': 479.08534166022685}
Training completed...

在上面的输出中,损耗正在减小并在增加。那么我应该在什么时候停止训练?

基本上如何确定所有训练参数?

1 个答案:

答案 0 :(得分:2)

签出命令行火车CLI,该命令行在每次迭代后对开发集运行评估。

python -m spacy train en output_dir train.json dev.json -p ner

有一个内置的提前停止选项(-ne),它可以检测模型性能何时开始下降并在经过一定数量的迭代后停止。

但是,数据格式不同。这是将TRAIN_DATA类型的格式转换为NER的CLI训练数据格式的一种方法:

import spacy
from spacy.gold import docs_to_json
import srsly

nlp = spacy.load('en', disable=["ner"])

TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]

docs = []
for text, annot in TRAIN_DATA:
    doc = nlp(text)
    doc.ents = [doc.char_span(start_idx, end_idx, label=label) for start_idx, end_idx, label in annot["entities"]]
    docs.append(doc)

srsly.write_json("ent_train_data.json", [docs_to_json(docs)])

如果您的数据采用python -m spacy convert支持的NER格式之一,则也可以采用这种方式进行转换。