如何“更新”现有的命名实体识别模型 - 而不是从头开始创建?

时间:2014-02-07 01:17:33

标签: java nlp opennlp corpus

请参阅OpenNLP的教程步骤 - 命名实体识别:Link to tutorial 我使用的是{en-ner-person.bin“模型here 在本教程中,有关于培训和创建新模型的说明。有没有办法用额外的训练数据“更新”现有的“en-ner-person.bin”?

假设我有500个额外人名的列表,否则这些人名不会被识别为人 - 我如何生成新模型?

2 个答案:

答案 0 :(得分:5)

抱歉,我花了一段时间才把一个不错的代码示例放在一起...... 以下代码在您的句子中读取,使用默认的enner人模型来做到最好。然后它将这些结果写入好的命中文件和坏命中的文件。然后我将这些文件提供给" modelbuilder-addon"在底部打电话。

要获得最佳结果,请按原样运行该类...然后进入已知实体文件和黑名单文件,并添加和删除名称。换句话说,把它根本找不到的名字,但你知道,知道,并从知识中删除坏名称。从黑名单文件中删除好名称,并将它们添加到knowns文件中。然后再次运行模型构建器部件,而不会读取所有数据和所有内容的第一部分。在知识和黑名单文件中有重复项是可以的。如果您有任何疑问,请告诉我......这有点复杂

import java.io.File;
import java.io.FileWriter;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import opennlp.addons.modelbuilder.DefaultModelBuilderUtil;
import opennlp.tools.entitylinker.EntityLinkerProperties;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.util.Span;

public class ModelBuilderAddonUse {
//fill this method in with however you are going to get your data into a list of sentences..for me I am hitting a MySQL database
  private static List<String> getSentencesFromSomewhere() throws Exception {
    List<String> sentences = new ArrayList<>();
    int counter = 0;
    DocProvider dp = new DocProvider();
    String modelPath = "c:\\apache\\entitylinker\\";
    EntityLinkerProperties properties = new EntityLinkerProperties(new File(modelPath + "entitylinker.properties"));
    Map<Long, List<String>> docs = dp.getDocs(properties);
    for (Long key : docs.keySet()) {
      counter++;
      System.out.println("\t\tDOC: " + key + "\n\n");
      String docu = "";
      sentences.addAll(docs.get(key));
      counter++;
      if(counter > 1000){
        break;
      }
    }
    return sentences;
  }

  public static void main(String[] args) throws Exception {
    /**
     * establish a file to put sentences in
     */
    File sentences = new File("C:\\temp\\modelbuilder\\sentences.text");

    /**
     * establish a file to put your NER hits in (the ones you want to keep based
     * on prob)
     */
    File knownEntities = new File("C:\\temp\\modelbuilder\\knownentities.txt");

    /**
     * establish a BLACKLIST file to put your bad NER hits in (also can be based
     * on prob)
     */
    File blacklistedentities = new File("C:\\temp\\modelbuilder\\blentities.txt");

    /**
     * establish a file to write your annotated sentences to
     */
    File annotatedSentences = new File("C:\\temp\\modelbuilder\\annotatedSentences.txt");

    /**
     * establish a file to write your model to
     */
    File theModel = new File("C:\\temp\\modelbuilder\\theModel");


//------------create a bunch of file writers to write your results and sentences to a file

    FileWriter sentenceWriter = new FileWriter(sentences, true);
    FileWriter blacklistWriter = new FileWriter(blacklistedentities, true);
    FileWriter knownEntityWriter = new FileWriter(knownEntities, true);

//set some thresholds to decide where to write hits, you don't have to use these at all...
    double keeperThresh = .95;
    double blacklistThresh = .7;


    /**
     * Load your model as normal
     */
    TokenNameFinderModel personModel = new TokenNameFinderModel(new File("c:\\temp\\opennlpmodels\\en-ner-person.zip"));
    NameFinderME personFinder = new NameFinderME(personModel);
    /**
     * do your normal NER on the sentences you have
     */
    for (String s : getSentencesFromSomewhere()) {
      sentenceWriter.write(s.trim() + "\n");
      sentenceWriter.flush();

      String[] tokens = s.split(" ");//better to use a tokenizer really
      Span[] find = personFinder.find(tokens);
      double[] probs = personFinder.probs();
      String[] names = Span.spansToStrings(find, tokens);
      for (int i = 0; i < names.length; i++) {
        //YOU PROBABLY HAVE BETTER HEURISTICS THAN THIS TO MAKE SURE YOU GET GOOD HITS OUT OF THE DEFAULT MODEL
        if (probs[i] > keeperThresh) {
          knownEntityWriter.write(names[i].trim() + "\n");
        }
        if (probs[i] < blacklistThresh) {
          blacklistWriter.write(names[i].trim() + "\n");
        }
      }
      personFinder.clearAdaptiveData();
      blacklistWriter.flush();
      knownEntityWriter.flush();
    }
    //flush and close all the writers
    knownEntityWriter.flush();
    knownEntityWriter.close();
    sentenceWriter.flush();
    sentenceWriter.close();
    blacklistWriter.flush();
    blacklistWriter.close();

    /**
     * THIS IS WHERE THE ADDON IS GOING TO USE THE FILES (AS IS) TO CREATE A NEW MODEL. YOU SHOULD NOT HAVE TO RUN THE FIRST PART AGAIN AFTER THIS RUNS, JUST NOW PLAY WITH THE
     * KNOWN ENTITIES AND BLACKLIST FILES AND RUN THE METHOD BELOW AGAIN UNTIL YOU GET SOME DECENT RESULTS (A DECENT MODEL OUT OF IT).
     */
    DefaultModelBuilderUtil.generateModel(sentences, knownEntities, blacklistedentities,
            theModel, annotatedSentences, "person", 3);


  }
}

这就是控制台应该是什么样的(为了简洁,我删除了一些行)

ITERATION: 0
    Perfoming Known Entity Annotation
        knowns: 625
        reading data....: 
        writing annotated sentences....: 
        building model.... 
    Building Model using 7343 annotations
        reading training data...
Indexing events using cutoff of 5

    Computing event counts...  done. 561755 events
    Indexing...  done.
Sorting and merging events... done. Reduced 561755 events to 127362.
Done indexing.
Incorporating indexed data for training...  
done.
    Number of Event Tokens: 127362
        Number of Outcomes: 3
      Number of Predicates: 106490
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-617150.9462211537  0.015709695507828147
  2:  ... loglikelihood=-90520.86903515142  0.9771288195031642
  3:  ... loglikelihood=-56901.86905339755  0.9771288195031642
  4:  ... loglikelihood=-44231.80460317638  0.9773086131854634
  5:  ... loglikelihood=-37222.56576767385  0.9787985865724381
  6:  ... loglikelihood=-32900.5623814595   0.9801924326441243
  7:  ... loglikelihood=-29992.881445391187 0.9829747843810914
  8:  ... loglikelihood=-27893.341149419102 0.9836423351817073
  9:  ... loglikelihood=-26296.107313900917 0.9845092611547739
 10:  ... loglikelihood=-25033.501573153182 0.9850682236918229
 11:  ... loglikelihood=-24006.060636903556 0.9856182855515305
 12:  ... loglikelihood=-23150.856525607975 0.9859084476328649
 13:  ... loglikelihood=-22425.987337392176 0.9861897090368577
 14:  ... loglikelihood=-21802.386362016423 0.9864211266477378
 15:  ... loglikelihood=-21259.20580401235  0.9865208142339632
 16:  ... loglikelihood=-20781.0716762281   0.9867362106256287
 17:  ... loglikelihood=-20356.37732369309  0.986905323495118
 18:  ... loglikelihood=-19976.18228587008  0.9870673158227341
 19:  ... loglikelihood=-19633.47877575036  0.9872097266601988
 20:  ... loglikelihood=-19322.689448146353 0.9873165347882974
 21:  ... loglikelihood=-19039.31522510173  0.9874073216971812
 22:  ... loglikelihood=-18779.683112448918 0.9875176900962164
 23:  ... loglikelihood=-18540.76222439295  0.9876316187661881
 24:  ... loglikelihood=-18320.027315327916 0.9877081645913254
 25:  ... loglikelihood=-18115.35602743375  0.9877918309583359
 26:  ... loglikelihood=-17924.95047403401  0.9878612562416
 27:  ... loglikelihood=-17747.27665623459  0.9879378020667373
 28:  ... loglikelihood=-17581.01712643139  0.9879947664017231
 29:  ... loglikelihood=-17425.03361369085  0.9880784327687337
 30:  ... loglikelihood=-17278.3372262906   0.9881282765618463
 31:  ... loglikelihood=-17140.06447937828  0.9882012621160471
 32:  ... loglikelihood=-17009.45784626013  0.9882546661800963
 33:  ... loglikelihood=-16885.84985637711  0.9883187510569554
 34:  ... loglikelihood=-16768.64999916476  0.9883703749855364
 35:  ... loglikelihood=-16657.3338665414   0.9884166585077124
 36:  ... loglikelihood=-16551.434095577726 0.9884558214880153
 37:  ... loglikelihood=-16450.532769374073 0.9885074454165962
 38:  ... loglikelihood=-16354.255007222264 0.9885448282614306
 39:  ... loglikelihood=-16262.263530858221 0.9885733104289236
 40:  ... loglikelihood=-16174.254036589966 0.9886391754412511
 41:  ... loglikelihood=-16089.951236435176 0.9886765582860856
 42:  ... loglikelihood=-16009.105457548561 0.9887281822146665
 43:  ... loglikelihood=-15931.489709807445 0.988747763704818
 44:  ... loglikelihood=-15856.897147780543 0.9887798061432475
 45:  ... loglikelihood=-15785.138866385483 0.9888065081752722
 46:  ... loglikelihood=-15716.041980029182 0.9888349903427651
 47:  ... loglikelihood=-15649.447943527766 0.9888581321038531
 48:  ... loglikelihood=-15585.211079986258 0.9888901745422827
 49:  ... loglikelihood=-15523.19728647256  0.9889328977935221
 50:  ... loglikelihood=-15463.282892914636 0.9889595998255467
 51:  ... loglikelihood=-15405.353653492159 0.9889685005028883
 52:  ... loglikelihood=-15349.303852923775 0.9889809614511664
 53:  ... loglikelihood=-15295.035512678789 0.9889934223994445
 54:  ... loglikelihood=-15242.457684348112 0.989013003889596
 55:  ... loglikelihood=-15191.485819217298 0.9890236847024059
 56:  ... loglikelihood=-15142.041204645499 0.9890397059216206
 57:  ... loglikelihood=-15094.050459152337 0.9890539470053671
 58:  ... loglikelihood=-15047.445079207273 0.9890592874117721
 59:  ... loglikelihood=-15002.161031666768 0.9890753086309868
 60:  ... loglikelihood=-14958.13838658306  0.9890966702566065
 61:  ... loglikelihood=-14915.320985817205 0.9891180318822262
 62:  ... loglikelihood=-14873.656143433394 0.9891269325595677
 63:  ... loglikelihood=-14833.094374397517 0.9891500743206558
 64:  ... loglikelihood=-14793.589148498404 0.9891589749979973
 65:  ... loglikelihood=-14755.096666806796 0.9891785564881488
 66:  ... loglikelihood=-14717.5756582924   0.9891892373009586
 67:  ... loglikelihood=-14680.98719451864  0.9891892373009586
 68:  ... loglikelihood=-14645.294520562966 0.9891945777073635
 69:  ... loglikelihood=-14610.462900520715 0.9891999181137685
 70:  ... loglikelihood=-14576.45947616036  0.989214159197515
 71:  ... loglikelihood=-14543.25313742511  0.9892212797393881
 72:  ... loglikelihood=-14510.814403643026 0.9892230598748565
 73:  ... loglikelihood=-14479.115314429962 0.9892230598748565
 74:  ... loglikelihood=-14448.129329357815 0.9892426413650078
 75:  ... loglikelihood=-14417.831235594616 0.9892515420423494
 76:  ... loglikelihood=-14388.19706276905  0.9892622228551593
 77:  ... loglikelihood=-14359.204004414    0.9892711235325008
 78:  ... loglikelihood=-14330.8303454032   0.9892764639389058
 79:  ... loglikelihood=-14303.055394843146 0.9892764639389058
 80:  ... loglikelihood=-14275.859423957678 0.9892924851581205
 81:  ... loglikelihood=-14249.223608524193 0.9893013858354621
 82:  ... loglikelihood=-14223.129975482772 0.9893209673256135
 83:  ... loglikelihood=-14197.561353359844 0.9893263077320185
 84:  ... loglikelihood=-14172.50132620183  0.9893280878674867
 85:  ... loglikelihood=-14147.934190713178 0.9893263077320185
 86:  ... loglikelihood=-14123.84491635766  0.9893316481384233
 87:  ... loglikelihood=-14100.21910816809  0.9894313357246487
 88:  ... loglikelihood=-14077.042972066316 0.989433115860117
 89:  ... loglikelihood=-14054.303282478262 0.9894437966729268
 90:  ... loglikelihood=-14031.987352086799 0.9894580377566733
 91:  ... loglikelihood=-14010.083003539214 0.9894615980276099
 92:  ... loglikelihood=-13988.578542971209 0.9894776192468246
 93:  ... loglikelihood=-13967.46273521311  0.9894811795177613
 94:  ... loglikelihood=-13946.724780546094 0.9894829596532296
 95:  ... loglikelihood=-13926.354292898612 0.9894829596532296
 96:  ... loglikelihood=-13906.341279379953 0.9894900801951029
 97:  ... loglikelihood=-13886.676121050288 0.9894936404660395
 98:  ... loglikelihood=-13867.34955484593  0.9894954206015077
 99:  ... loglikelihood=-13848.35265657199  0.9894954206015077
100:  ... loglikelihood=-13829.676824889664 0.9894972007369761
    model generated
        model building complete.... 
        annotated sentences: 7343
    Performing NER with new model
        Printing NER Results. Add undesired results to the blacklist file and start over

//prints some names

    annotated sentences: 7369
        knowns: 651
ITERATION: 1
    Perfoming Known Entity Annotation
        knowns: 651
        reading data....: 
        writing annotated sentences....: 
        building model.... 
    Building Model using 20370 annotations
        reading training data...
Indexing events using cutoff of 5

    Computing event counts...  done. 1116781 events
    Indexing...  done.
Sorting and merging events... done. Reduced 1116781 events to 288251.
Done indexing.
Incorporating indexed data for training...  
done.
    Number of Event Tokens: 288251
        Number of Outcomes: 3
      Number of Predicates: 206399
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-1226909.3303549637 0.03418485808766446
  2:  ... loglikelihood=-196688.7107544095  0.9622047653031346
  3:  ... loglikelihood=-138615.22912914792 0.9651462551744702
  4:  ... loglikelihood=-114777.09879832959 0.9697075791941303
  5:  ... loglikelihood=-101055.0229949508  0.9716443958126079
  6:  ... loglikelihood=-92253.8923255943   0.973049326591337
  7:  ... loglikelihood=-86146.35307405592  0.9750121107003074
  8:  ... loglikelihood=-81641.85792288609  0.975682788299586
  9:  ... loglikelihood=-78164.62963136223  0.9762594456746667
 10:  ... loglikelihood=-75386.40867917785  0.9767044747358703
 11:  ... loglikelihood=-73106.85371375803  0.9770590652957025
 12:  ... loglikelihood=-71196.60721959372  0.9774718588514668
 13:  ... loglikelihood=-69568.23683712543  0.9777279520335679
 14:  ... loglikelihood=-68160.39924327709  0.9779374828189233
 15:  ... loglikelihood=-66928.70260893498  0.9780914969004666
 16:  ... loglikelihood=-65840.17418566217  0.9782661058882628
 17:  ... loglikelihood=-64869.77222395241  0.9784040022170865
 18:  ... loglikelihood=-63998.109674075415 0.9785159310554173
 19:  ... loglikelihood=-63209.92394252923  0.9786475593692944
 20:  ... loglikelihood=-62493.02131098982  0.9787505339005589
 21:  ... loglikelihood=-61837.53211219312  0.9788597764467698
 22:  ... loglikelihood=-61235.37451190329  0.9789457377946079
 23:  ... loglikelihood=-60679.86146007204  0.9790003590677133
 24:  ... loglikelihood=-60165.407875448924 0.979062143786472
 25:  ... loglikelihood=-59687.30928567587  0.9791346736737104
 26:  ... loglikelihood=-59241.572255584455 0.979201830976709
 27:  ... loglikelihood=-58824.78291785096  0.9792698837104141
 28:  ... loglikelihood=-58434.00392167818  0.979333459290586
 29:  ... loglikelihood=-58066.69284046825  0.979381812548745
 30:  ... loglikelihood=-57720.63696783972  0.9794355383911438
 31:  ... loglikelihood=-57393.9007602091   0.9795089637090889
 32:  ... loglikelihood=-57084.78313293037  0.9795483626601814
 33:  ... loglikelihood=-56791.78250307578  0.9795743301506741
 34:  ... loglikelihood=-56513.567973701254 0.9796298468544863
 35:  ... loglikelihood=-56248.955425711436 0.9796808864047651
 36:  ... loglikelihood=-55996.887560355084 0.9797202853558576
 37:  ... loglikelihood=-55756.41714443519  0.9797543117227102
 38:  ... loglikelihood=-55526.69286884015  0.9797963969659226
 39:  ... loglikelihood=-55306.94735282102  0.9798152010107621
 40:  ... loglikelihood=-55096.48692031122  0.9798563908232679
 41:  ... loglikelihood=-54894.68284780714  0.9799029532200136
 42:  ... loglikelihood=-54700.963840494    0.9799378750175728
 43:  ... loglikelihood=-54514.80953871555  0.9799656333694788
 44:  ... loglikelihood=-54335.744892614406 0.9800005551670381
 45:  ... loglikelihood=-54163.33527156895  0.9800301043803574
 46:  ... loglikelihood=-53997.182198154995 0.9800551764401436
 47:  ... loglikelihood=-53836.91961491415  0.980082039361343
 48:  ... loglikelihood=-53682.210607423985 0.980112484005369
 49:  ... loglikelihood=-53532.74451955152  0.980140242357275
 50:  ... loglikelihood=-53388.23440690913  0.9801688961398878
 51:  ... loglikelihood=-53248.41478285541  0.9801921773382606
 52:  ... loglikelihood=-53113.03961847529  0.9802109813831001
 53:  ... loglikelihood=-52981.880563479055 0.9802351580121796
 54:  ... loglikelihood=-52854.7253600851   0.9802584392105524
 55:  ... loglikelihood=-52731.37642565477  0.9802727661018589
 56:  ... loglikelihood=-52611.64958353087  0.9803005244537649
 57:  ... loglikelihood=-52495.37292415569  0.9803148513450712
 58:  ... loglikelihood=-52382.38578113555  0.9803470868505105
 59:  ... loglikelihood=-52272.53780883427  0.9803748452024166
 60:  ... loglikelihood=-52165.68814994865  0.9803891720937229
 61:  ... loglikelihood=-52061.7046829472   0.9804043944157359
 62:  ... loglikelihood=-51960.46334051503  0.9804151395842157
 63:  ... loglikelihood=-51861.84749132724  0.9804393162132952
 64:  ... loglikelihood=-51765.74737831825  0.9804491659510683
 65:  ... loglikelihood=-51672.05960757943  0.9804634928423747
 66:  ... loglikelihood=-51580.686682513515 0.9804876694714542
 67:  ... loglikelihood=-51491.53657871175  0.9805046826548804
 68:  ... loglikelihood=-51404.52235540815  0.9805172186847735
 69:  ... loglikelihood=-51319.56179989248  0.9805315455760798
 70:  ... loglikelihood=-51236.577101627925 0.9805440816059728
 71:  ... loglikelihood=-51155.494553260556 0.9805584084972793
 72:  ... loglikelihood=-51076.24427590388  0.980569153665759
 73:  ... loglikelihood=-50998.75996642977  0.9805825851263587
 74:  ... loglikelihood=-50922.97866477339  0.9805951211562518
 75:  ... loglikelihood=-50848.84053937224  0.9806112389089714
 76:  ... loglikelihood=-50776.28868909037  0.9806264612309844
 77:  ... loglikelihood=-50705.2689602481   0.9806389972608774
 78:  ... loglikelihood=-50635.729777298875 0.9806470561372372
 79:  ... loglikelihood=-50567.62198610024  0.9806658601820769
 80:  ... loglikelihood=-50500.8987085974   0.9806685464741968
 81:  ... loglikelihood=-50435.51520800019  0.9806775007812633
 82:  ... loglikelihood=-50371.42876358994  0.9806837687962098
 83:  ... loglikelihood=-50308.59855431275  0.9806918276725697
 84:  ... loglikelihood=-50246.98555046764  0.9806989911182228
 85:  ... loglikelihood=-50186.55241287111  0.980703468271756
 86:  ... loglikelihood=-50127.26339882067  0.9807195860244757
 87:  ... loglikelihood=-50069.08427441567  0.9807312266236621
 88:  ... loglikelihood=-50011.9822326526   0.9807357037771953
 89:  ... loglikelihood=-49955.92581691934  0.9807446580842618
 90:  ... loglikelihood=-49900.88484943885  0.9807527169606216
 91:  ... loglikelihood=-49846.83036430355  0.9807634621291014
 92:  ... loglikelihood=-49793.734544757914 0.9807724164361679
 93:  ... loglikelihood=-49741.57066440427  0.9807786844511144
 94:  ... loglikelihood=-49690.31303207665  0.9807840570353543
 95:  ... loglikelihood=-49639.93694007888  0.9807948022038341
 96:  ... loglikelihood=-49590.418615580194 0.9808001747880739
 97:  ... loglikelihood=-49541.73517492774  0.9808073382337271
 98:  ... loglikelihood=-49493.86458067577  0.9808145016793803
 99:  ... loglikelihood=-49446.785601155134 0.9808234559864467
100:  ... loglikelihood=-49400.477772387036 0.9808359920163399
    model generated
        model building complete.... 
        annotated sentences: 20370
    Performing NER with new model


it will do this for each iteration  util you see
......
 97:  ... loglikelihood=-49140.50129715517  0.9808462362240823
 98:  ... loglikelihood=-49095.42289306763  0.9808641444693966
 99:  ... loglikelihood=-49051.095083380205 0.9808713077675223
100:  ... loglikelihood=-49007.49834809576  0.9808748894165852
    model generated

如果您看到带注释的句子停止更改,您可以更改num迭代,并且在您优化列表时,知识会在后续运行时停止更改。

HTH

答案 1 :(得分:2)

不幸的是,无法附加到模型。但是您可以使用模型来查找它可以找到的内容,并将它找到的命中数写入“已知实体”文件,并将句子写出到文件中。然后,您可以将您知道未被识别的其他名称添加到“已知实体”文件中(以及它们可能包含在句子文件中的更多句子)。然后你可以使用名为modelbuilder-addon的OpenNLP插件来使用句子文件和“已知实体”文件构建新模型

请参阅此帖子以获取代码示例。

OpenNLP: foreign names does not get recognized

这是一个非常新的插件,让我知道它是如何工作的。