NER训练循环中的损失并没有减少

时间:2019-05-10 17:20:23

标签: python deep-learning nlp spacy

我正在尝试训练一种新的实体类型“ HE INST”,以认可大学。 那是唯一的新标签。我有一个很长的文档作为原始文本。我在其上运行了NER,然后将实体保存到TRAIN DATA,然后将新的实体标签添加到TRAIN_DATA(我在有重叠的地方替换了该实体)。

训练循环的损失值恒定(对于所有15个文本,损失值约为4000;对于单个数据,损失值约为300)。为什么会发生这种情况,如何正确训练模型。我大约有18个文本,带有40个带注释的新实体。即使经过所有迭代,该模型仍无法正确预测输出。

我没有对脚本进行太多更改。刚刚添加了en_core_web_lg,新标签和我的TRAIN_DATA

我正在尝试从简历(C.V)数据中标记研究所:

这将是我在TRAIN_DATA中的文字之一:(长文本很抱歉) 我大约有18条这样的文本被浓缩成TRAIN_DATA

[("To perform better in my work each day. To increase my knowledge. To bring out my best by hardworking and improving my skills. To serve my parents and my family. To contribute my skills to my country. Marital ; Single Status Nationality \xe2\x80\x94: Indian Known . Parr . English, Malayalam, Hindi, Tamil Languages Hobby Playing cricket and football, Listening to music, Movies, Games. Father's ; V.N. Balappan Nair Name Mother's ; Saraswathy B Nair Name Believers Church Caarmel Engineering College R-Perunad Btech Electronics and communication engineering 6.09(Upto S6) 2015 - 2019 Marthoma Senior Secondary School Kozhencherry All India Senior School Certificate Examination 75% 2014 - 2015 Marthoma Senior Secondary School Kozhencherry Secondary School Examination 8.2 2012 - 2013 s@ INTERESTS Electronics, Sports s@ PERSONAL STRENGTHS Hardworking Loyal Good Team Spirit Good in mathematics ees IAA eM LANL NUL e (2 Problem Solving Skills rg DUS \\ TRAININGS completed the Vocational Industrial Training on Long Distance Communication Systems conducted by Southern Telecom Region, Bharat Sanchar Nigam Limited. Completed the internship training in Power Electronics Group(PEG), Tool Room, Fabrication Shop, Transform Winding, Electro Plating, Security And Surveillance Group(SSG), Special Products Group(SPG), Search And Rescue Beacon(SRB), Intelligent Tracking and Communication Project and Technology Development Center of Keltron Equipment Complex, Thiruvananthapuram. PROJECTS Final Year Project: Life Detection Using Quadcopter This project is useful at the time of natural calamities like flood earthquake etc... And can also be used in military applications as this device detects life signals using a PIR sensor and a thermal sensor. The components used in this are: PIR sensor, Thermal sensor, Arduino Nano, BEC, ESC, Quadcopter. Design project: Wireless Power Bank Wireless Power Bank enables us to charge our phone wordlessly. It can charge a device which is kept 10m(maximum) away from the adaptor without any obstacles in between. It uses the IR technology for power transmission. ACHIEVEMENTS & AWARDS Participated in Pecardio Debugging Conducted as a part of NAKSHATRA 2019, The Annual National Level Techno Cultural Fest held at Saingits College of Engineering, kottayam. Volunteered in Alexa One day workshop on Artificial intelligence. Completed a period of two year tenue with a total of 240 hours in the National Service Scheme activities and has attended NSS Annual Special Camp. Participant in Cricket and football at the Annual Sports Meets. DECLARATION do here by confirm that the information given in this form is true to the best of my knowledge and belief.", {'entities': [(29, 37, 'DATE'), (210, 223, 'ORG'), (241, 247, 'NORP'), (256, 260, 'PERSON'), (263, 270, 'LANGUAGE'), (272, 281, 'PERSON'), (283, 288, 'PERSON'), (290, 295, 'NORP'), (362, 375, 'EVENT'), (388, 401, 'PERSON'), (402, 420, 'PERSON'), (423, 445, 'PERSON'), (446, 490, 'HE INST'), (563, 574, 'DATE'), (575, 620, 'ORG'), (625, 668, 'ORG'), (669, 672, 'PERCENT'), (673, 684, 'DATE'), (685, 717, 'ORG'), (764, 775, 'DATE'), (779, 800, 'ORG'), (890, 893, 'ORG'), (909, 910, 'CARDINAL'), (963, 997, 'ORG'), (1001, 1036, 'ORG'), (1050, 1073, 'ORG'), (1075, 1103, 'ORG'), (1142, 1169, 'ORG'), (1172, 1181, 'ORG'), (1183, 1199, 'ORG'), (1201, 1218, 'ORG'), (1220, 1235, 'ORG'), (1275, 1301, 'ORG'), (1304, 1332, 'ORG'), (1335, 1355, 'ORG'), (1360, 1415, 'ORG'), (1419, 1444, 'ORG'), (1446, 1464, 'LOC'), (1475, 1494, 'EVENT'), (1797, 1809, 'GPE'), (1811, 1814, 'GPE'), (1816, 1819, 'ORG'), (1821, 1831, 'ORG'), (1849, 1888, 'ORG'), (1969, 1980, 'CARDINAL'), (2050, 2052, 'ORG'), (2088, 2122, 'ORG'), (2126, 2154, 'ORG'), (2168, 2182, 'EVENT'), (2188, 2194, 'DATE'), (2239, 2270, 'HE INST'), (2297, 2302, 'GPE'), (2303, 2310, 'DATE'), (2358, 2369, 'DATE'), (2370, 2378, 'DATE'), (2401, 2410, 'TIME'), (2414, 2441, 'ORG'), (2470, 2493, 'ORG'), (2534, 2557, 'EVENT')]})]

脚本如下:(注:-eval函数用于在从文本文件中将TRAIN_DATA读取为字符串后解析TRAIN_DATA以列出列表-----您很可能知道,但以防万一)

from __future__ import unicode_literals, print_function

import plac
import random
from pathlib import Path
import spacy
import en_core_web_lg
from spacy.util import minibatch, compounding


# new entity label
LABEL = "HE INST"

with open('train_dump-backup.txt', 'r') as i_file:
    t_data = i_file.read()
TRAIN_DATA=eval(t_data)

@plac.annotations(
    model=("en_core_web_lg", "option", "m", str),
    new_model_name=("NLP_INST", "option", "nm", str),
    output_dir=("/home/drbinu/Downloads/NLP_INST", "option", "o", Path),
    n_iter=("30", "option", "n", int),
)

def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
    """Set up the pipeline and entity recognizer, and train the new entity."""
    random.seed(0)
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")
    # Add entity recognizer to model if it's not in the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner)
    # otherwise, get it, so we can add labels to it
    else:
        ner = nlp.get_pipe("ner")

    ner.add_label(LABEL)  # add new entity label to entity recognizer
    # Adding extraneous labels shouldn't mess anything up
    ner.add_label("VEGETABLE")
    if model is None:
        optimizer = nlp.begin_training()
    else:
        optimizer = nlp.resume_training()
    move_names = list(ner.move_names)
    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        sizes = compounding(1.0, 4.0, 1.001)
        # batch up the examples using spaCy's minibatch
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            batches = minibatch(TRAIN_DATA, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)

    # test the trained model
    test_text = "B.Tech from Believers Church Caarmel Engineering College CGPA of 8.9"
    doc = nlp(test_text)
    print("Entities in '%s'" % test_text)
    for ent in doc.ents:
        print(ent.label_, ent.text)

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.meta["name"] = new_model_name  # rename model
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        # Check the classes have loaded back consistently
        assert nlp2.get_pipe("ner").move_names == move_names
        doc2 = nlp2(test_text)
        for ent in doc2.ents:
            print(ent.label_, ent.text)


if __name__ == "__main__":
    plac.call(main)

1 个答案:

答案 0 :(得分:1)

由于更新过程中管道组件会增加损耗,因此损失似乎正在增加:

https://github.com/explosion/spaCy/blob/ae4af52ce7dd9dda0eb0f1b8eeb0cba7d20facdf/spacy/pipeline/pipes.pyx#L989

在每个纪元开始时,您可能要快照总累积损失;在该阶段结束时,您可以根据观察到的数据计算平均损失。