要训练新的自定义实体,我们可以使用此处提到的步骤来训练模型:https://spacy.io/usage/training#ner
但是我想知道如何决定不进行迭代,丢弃和批量大小以过度拟合或欠拟合模型?
One example of loss is:
Starting training....
Losses: {'ner': 3875.2103796127717}
Losses: {'ner': 3091.347521599567}
Losses: {'ner': 2811.074334355512}
Losses: {'ner': 2235.2944185569686}
Losses: {'ner': 2015.7072019365773}
Losses: {'ner': 1647.0052678292357}
Losses: {'ner': 1746.1746172501762}
Losses: {'ner': 1350.2094295662862}
Losses: {'ner': 1302.3405612718204}
Losses: {'ner': 1322.3590930188122}
Losses: {'ner': 1070.3760899125737}
Losses: {'ner': 990.9221824283309}
Losses: {'ner': 961.2431416302175}
Losses: {'ner': 885.3743390914278}
Losses: {'ner': 838.3100930655886}
Losses: {'ner': 733.5780730531789}
Losses: {'ner': 915.0732067395388}
Losses: {'ner': 734.7598118888878}
Losses: {'ner': 645.5447305966479}
Losses: {'ner': 615.6987186405088}
Losses: {'ner': 624.112212173154}
Losses: {'ner': 590.4118676242763}
Losses: {'ner': 411.8125225993247}
Losses: {'ner': 482.4468110898493}
Losses: {'ner': 479.08534166022685}
Training completed...
在上面的输出中,损耗正在减小并在增加。那么我应该在什么时候停止训练?
基本上如何确定所有训练参数?
答案 0 :(得分:2)
签出命令行火车CLI,该命令行在每次迭代后对开发集运行评估。
python -m spacy train en output_dir train.json dev.json -p ner
有一个内置的提前停止选项(-ne
),它可以检测模型性能何时开始下降并在经过一定数量的迭代后停止。
但是,数据格式不同。这是将TRAIN_DATA类型的格式转换为NER的CLI训练数据格式的一种方法:
import spacy
from spacy.gold import docs_to_json
import srsly
nlp = spacy.load('en', disable=["ner"])
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]
docs = []
for text, annot in TRAIN_DATA:
doc = nlp(text)
doc.ents = [doc.char_span(start_idx, end_idx, label=label) for start_idx, end_idx, label in annot["entities"]]
docs.append(doc)
srsly.write_json("ent_train_data.json", [docs_to_json(docs)])
如果您的数据采用python -m spacy convert
支持的NER格式之一,则也可以采用这种方式进行转换。