请参阅OpenNLP的教程步骤 - 命名实体识别:Link to tutorial 我使用的是{en-ner-person.bin“模型here 在本教程中,有关于培训和创建新模型的说明。有没有办法用额外的训练数据“更新”现有的“en-ner-person.bin”?
假设我有500个额外人名的列表,否则这些人名不会被识别为人 - 我如何生成新模型?
答案 0 :(得分:5)
抱歉,我花了一段时间才把一个不错的代码示例放在一起...... 以下代码在您的句子中读取,使用默认的enner人模型来做到最好。然后它将这些结果写入好的命中文件和坏命中的文件。然后我将这些文件提供给" modelbuilder-addon"在底部打电话。
要获得最佳结果,请按原样运行该类...然后进入已知实体文件和黑名单文件,并添加和删除名称。换句话说,把它根本找不到的名字,但你知道,知道,并从知识中删除坏名称。从黑名单文件中删除好名称,并将它们添加到knowns文件中。然后再次运行模型构建器部件,而不会读取所有数据和所有内容的第一部分。在知识和黑名单文件中有重复项是可以的。如果您有任何疑问,请告诉我......这有点复杂
import java.io.File;
import java.io.FileWriter;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import opennlp.addons.modelbuilder.DefaultModelBuilderUtil;
import opennlp.tools.entitylinker.EntityLinkerProperties;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.util.Span;
public class ModelBuilderAddonUse {
//fill this method in with however you are going to get your data into a list of sentences..for me I am hitting a MySQL database
private static List<String> getSentencesFromSomewhere() throws Exception {
List<String> sentences = new ArrayList<>();
int counter = 0;
DocProvider dp = new DocProvider();
String modelPath = "c:\\apache\\entitylinker\\";
EntityLinkerProperties properties = new EntityLinkerProperties(new File(modelPath + "entitylinker.properties"));
Map<Long, List<String>> docs = dp.getDocs(properties);
for (Long key : docs.keySet()) {
counter++;
System.out.println("\t\tDOC: " + key + "\n\n");
String docu = "";
sentences.addAll(docs.get(key));
counter++;
if(counter > 1000){
break;
}
}
return sentences;
}
public static void main(String[] args) throws Exception {
/**
* establish a file to put sentences in
*/
File sentences = new File("C:\\temp\\modelbuilder\\sentences.text");
/**
* establish a file to put your NER hits in (the ones you want to keep based
* on prob)
*/
File knownEntities = new File("C:\\temp\\modelbuilder\\knownentities.txt");
/**
* establish a BLACKLIST file to put your bad NER hits in (also can be based
* on prob)
*/
File blacklistedentities = new File("C:\\temp\\modelbuilder\\blentities.txt");
/**
* establish a file to write your annotated sentences to
*/
File annotatedSentences = new File("C:\\temp\\modelbuilder\\annotatedSentences.txt");
/**
* establish a file to write your model to
*/
File theModel = new File("C:\\temp\\modelbuilder\\theModel");
//------------create a bunch of file writers to write your results and sentences to a file
FileWriter sentenceWriter = new FileWriter(sentences, true);
FileWriter blacklistWriter = new FileWriter(blacklistedentities, true);
FileWriter knownEntityWriter = new FileWriter(knownEntities, true);
//set some thresholds to decide where to write hits, you don't have to use these at all...
double keeperThresh = .95;
double blacklistThresh = .7;
/**
* Load your model as normal
*/
TokenNameFinderModel personModel = new TokenNameFinderModel(new File("c:\\temp\\opennlpmodels\\en-ner-person.zip"));
NameFinderME personFinder = new NameFinderME(personModel);
/**
* do your normal NER on the sentences you have
*/
for (String s : getSentencesFromSomewhere()) {
sentenceWriter.write(s.trim() + "\n");
sentenceWriter.flush();
String[] tokens = s.split(" ");//better to use a tokenizer really
Span[] find = personFinder.find(tokens);
double[] probs = personFinder.probs();
String[] names = Span.spansToStrings(find, tokens);
for (int i = 0; i < names.length; i++) {
//YOU PROBABLY HAVE BETTER HEURISTICS THAN THIS TO MAKE SURE YOU GET GOOD HITS OUT OF THE DEFAULT MODEL
if (probs[i] > keeperThresh) {
knownEntityWriter.write(names[i].trim() + "\n");
}
if (probs[i] < blacklistThresh) {
blacklistWriter.write(names[i].trim() + "\n");
}
}
personFinder.clearAdaptiveData();
blacklistWriter.flush();
knownEntityWriter.flush();
}
//flush and close all the writers
knownEntityWriter.flush();
knownEntityWriter.close();
sentenceWriter.flush();
sentenceWriter.close();
blacklistWriter.flush();
blacklistWriter.close();
/**
* THIS IS WHERE THE ADDON IS GOING TO USE THE FILES (AS IS) TO CREATE A NEW MODEL. YOU SHOULD NOT HAVE TO RUN THE FIRST PART AGAIN AFTER THIS RUNS, JUST NOW PLAY WITH THE
* KNOWN ENTITIES AND BLACKLIST FILES AND RUN THE METHOD BELOW AGAIN UNTIL YOU GET SOME DECENT RESULTS (A DECENT MODEL OUT OF IT).
*/
DefaultModelBuilderUtil.generateModel(sentences, knownEntities, blacklistedentities,
theModel, annotatedSentences, "person", 3);
}
}
这就是控制台应该是什么样的(为了简洁,我删除了一些行)
ITERATION: 0
Perfoming Known Entity Annotation
knowns: 625
reading data....:
writing annotated sentences....:
building model....
Building Model using 7343 annotations
reading training data...
Indexing events using cutoff of 5
Computing event counts... done. 561755 events
Indexing... done.
Sorting and merging events... done. Reduced 561755 events to 127362.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 127362
Number of Outcomes: 3
Number of Predicates: 106490
...done.
Computing model parameters ...
Performing 100 iterations.
1: ... loglikelihood=-617150.9462211537 0.015709695507828147
2: ... loglikelihood=-90520.86903515142 0.9771288195031642
3: ... loglikelihood=-56901.86905339755 0.9771288195031642
4: ... loglikelihood=-44231.80460317638 0.9773086131854634
5: ... loglikelihood=-37222.56576767385 0.9787985865724381
6: ... loglikelihood=-32900.5623814595 0.9801924326441243
7: ... loglikelihood=-29992.881445391187 0.9829747843810914
8: ... loglikelihood=-27893.341149419102 0.9836423351817073
9: ... loglikelihood=-26296.107313900917 0.9845092611547739
10: ... loglikelihood=-25033.501573153182 0.9850682236918229
11: ... loglikelihood=-24006.060636903556 0.9856182855515305
12: ... loglikelihood=-23150.856525607975 0.9859084476328649
13: ... loglikelihood=-22425.987337392176 0.9861897090368577
14: ... loglikelihood=-21802.386362016423 0.9864211266477378
15: ... loglikelihood=-21259.20580401235 0.9865208142339632
16: ... loglikelihood=-20781.0716762281 0.9867362106256287
17: ... loglikelihood=-20356.37732369309 0.986905323495118
18: ... loglikelihood=-19976.18228587008 0.9870673158227341
19: ... loglikelihood=-19633.47877575036 0.9872097266601988
20: ... loglikelihood=-19322.689448146353 0.9873165347882974
21: ... loglikelihood=-19039.31522510173 0.9874073216971812
22: ... loglikelihood=-18779.683112448918 0.9875176900962164
23: ... loglikelihood=-18540.76222439295 0.9876316187661881
24: ... loglikelihood=-18320.027315327916 0.9877081645913254
25: ... loglikelihood=-18115.35602743375 0.9877918309583359
26: ... loglikelihood=-17924.95047403401 0.9878612562416
27: ... loglikelihood=-17747.27665623459 0.9879378020667373
28: ... loglikelihood=-17581.01712643139 0.9879947664017231
29: ... loglikelihood=-17425.03361369085 0.9880784327687337
30: ... loglikelihood=-17278.3372262906 0.9881282765618463
31: ... loglikelihood=-17140.06447937828 0.9882012621160471
32: ... loglikelihood=-17009.45784626013 0.9882546661800963
33: ... loglikelihood=-16885.84985637711 0.9883187510569554
34: ... loglikelihood=-16768.64999916476 0.9883703749855364
35: ... loglikelihood=-16657.3338665414 0.9884166585077124
36: ... loglikelihood=-16551.434095577726 0.9884558214880153
37: ... loglikelihood=-16450.532769374073 0.9885074454165962
38: ... loglikelihood=-16354.255007222264 0.9885448282614306
39: ... loglikelihood=-16262.263530858221 0.9885733104289236
40: ... loglikelihood=-16174.254036589966 0.9886391754412511
41: ... loglikelihood=-16089.951236435176 0.9886765582860856
42: ... loglikelihood=-16009.105457548561 0.9887281822146665
43: ... loglikelihood=-15931.489709807445 0.988747763704818
44: ... loglikelihood=-15856.897147780543 0.9887798061432475
45: ... loglikelihood=-15785.138866385483 0.9888065081752722
46: ... loglikelihood=-15716.041980029182 0.9888349903427651
47: ... loglikelihood=-15649.447943527766 0.9888581321038531
48: ... loglikelihood=-15585.211079986258 0.9888901745422827
49: ... loglikelihood=-15523.19728647256 0.9889328977935221
50: ... loglikelihood=-15463.282892914636 0.9889595998255467
51: ... loglikelihood=-15405.353653492159 0.9889685005028883
52: ... loglikelihood=-15349.303852923775 0.9889809614511664
53: ... loglikelihood=-15295.035512678789 0.9889934223994445
54: ... loglikelihood=-15242.457684348112 0.989013003889596
55: ... loglikelihood=-15191.485819217298 0.9890236847024059
56: ... loglikelihood=-15142.041204645499 0.9890397059216206
57: ... loglikelihood=-15094.050459152337 0.9890539470053671
58: ... loglikelihood=-15047.445079207273 0.9890592874117721
59: ... loglikelihood=-15002.161031666768 0.9890753086309868
60: ... loglikelihood=-14958.13838658306 0.9890966702566065
61: ... loglikelihood=-14915.320985817205 0.9891180318822262
62: ... loglikelihood=-14873.656143433394 0.9891269325595677
63: ... loglikelihood=-14833.094374397517 0.9891500743206558
64: ... loglikelihood=-14793.589148498404 0.9891589749979973
65: ... loglikelihood=-14755.096666806796 0.9891785564881488
66: ... loglikelihood=-14717.5756582924 0.9891892373009586
67: ... loglikelihood=-14680.98719451864 0.9891892373009586
68: ... loglikelihood=-14645.294520562966 0.9891945777073635
69: ... loglikelihood=-14610.462900520715 0.9891999181137685
70: ... loglikelihood=-14576.45947616036 0.989214159197515
71: ... loglikelihood=-14543.25313742511 0.9892212797393881
72: ... loglikelihood=-14510.814403643026 0.9892230598748565
73: ... loglikelihood=-14479.115314429962 0.9892230598748565
74: ... loglikelihood=-14448.129329357815 0.9892426413650078
75: ... loglikelihood=-14417.831235594616 0.9892515420423494
76: ... loglikelihood=-14388.19706276905 0.9892622228551593
77: ... loglikelihood=-14359.204004414 0.9892711235325008
78: ... loglikelihood=-14330.8303454032 0.9892764639389058
79: ... loglikelihood=-14303.055394843146 0.9892764639389058
80: ... loglikelihood=-14275.859423957678 0.9892924851581205
81: ... loglikelihood=-14249.223608524193 0.9893013858354621
82: ... loglikelihood=-14223.129975482772 0.9893209673256135
83: ... loglikelihood=-14197.561353359844 0.9893263077320185
84: ... loglikelihood=-14172.50132620183 0.9893280878674867
85: ... loglikelihood=-14147.934190713178 0.9893263077320185
86: ... loglikelihood=-14123.84491635766 0.9893316481384233
87: ... loglikelihood=-14100.21910816809 0.9894313357246487
88: ... loglikelihood=-14077.042972066316 0.989433115860117
89: ... loglikelihood=-14054.303282478262 0.9894437966729268
90: ... loglikelihood=-14031.987352086799 0.9894580377566733
91: ... loglikelihood=-14010.083003539214 0.9894615980276099
92: ... loglikelihood=-13988.578542971209 0.9894776192468246
93: ... loglikelihood=-13967.46273521311 0.9894811795177613
94: ... loglikelihood=-13946.724780546094 0.9894829596532296
95: ... loglikelihood=-13926.354292898612 0.9894829596532296
96: ... loglikelihood=-13906.341279379953 0.9894900801951029
97: ... loglikelihood=-13886.676121050288 0.9894936404660395
98: ... loglikelihood=-13867.34955484593 0.9894954206015077
99: ... loglikelihood=-13848.35265657199 0.9894954206015077
100: ... loglikelihood=-13829.676824889664 0.9894972007369761
model generated
model building complete....
annotated sentences: 7343
Performing NER with new model
Printing NER Results. Add undesired results to the blacklist file and start over
//prints some names
annotated sentences: 7369
knowns: 651
ITERATION: 1
Perfoming Known Entity Annotation
knowns: 651
reading data....:
writing annotated sentences....:
building model....
Building Model using 20370 annotations
reading training data...
Indexing events using cutoff of 5
Computing event counts... done. 1116781 events
Indexing... done.
Sorting and merging events... done. Reduced 1116781 events to 288251.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 288251
Number of Outcomes: 3
Number of Predicates: 206399
...done.
Computing model parameters ...
Performing 100 iterations.
1: ... loglikelihood=-1226909.3303549637 0.03418485808766446
2: ... loglikelihood=-196688.7107544095 0.9622047653031346
3: ... loglikelihood=-138615.22912914792 0.9651462551744702
4: ... loglikelihood=-114777.09879832959 0.9697075791941303
5: ... loglikelihood=-101055.0229949508 0.9716443958126079
6: ... loglikelihood=-92253.8923255943 0.973049326591337
7: ... loglikelihood=-86146.35307405592 0.9750121107003074
8: ... loglikelihood=-81641.85792288609 0.975682788299586
9: ... loglikelihood=-78164.62963136223 0.9762594456746667
10: ... loglikelihood=-75386.40867917785 0.9767044747358703
11: ... loglikelihood=-73106.85371375803 0.9770590652957025
12: ... loglikelihood=-71196.60721959372 0.9774718588514668
13: ... loglikelihood=-69568.23683712543 0.9777279520335679
14: ... loglikelihood=-68160.39924327709 0.9779374828189233
15: ... loglikelihood=-66928.70260893498 0.9780914969004666
16: ... loglikelihood=-65840.17418566217 0.9782661058882628
17: ... loglikelihood=-64869.77222395241 0.9784040022170865
18: ... loglikelihood=-63998.109674075415 0.9785159310554173
19: ... loglikelihood=-63209.92394252923 0.9786475593692944
20: ... loglikelihood=-62493.02131098982 0.9787505339005589
21: ... loglikelihood=-61837.53211219312 0.9788597764467698
22: ... loglikelihood=-61235.37451190329 0.9789457377946079
23: ... loglikelihood=-60679.86146007204 0.9790003590677133
24: ... loglikelihood=-60165.407875448924 0.979062143786472
25: ... loglikelihood=-59687.30928567587 0.9791346736737104
26: ... loglikelihood=-59241.572255584455 0.979201830976709
27: ... loglikelihood=-58824.78291785096 0.9792698837104141
28: ... loglikelihood=-58434.00392167818 0.979333459290586
29: ... loglikelihood=-58066.69284046825 0.979381812548745
30: ... loglikelihood=-57720.63696783972 0.9794355383911438
31: ... loglikelihood=-57393.9007602091 0.9795089637090889
32: ... loglikelihood=-57084.78313293037 0.9795483626601814
33: ... loglikelihood=-56791.78250307578 0.9795743301506741
34: ... loglikelihood=-56513.567973701254 0.9796298468544863
35: ... loglikelihood=-56248.955425711436 0.9796808864047651
36: ... loglikelihood=-55996.887560355084 0.9797202853558576
37: ... loglikelihood=-55756.41714443519 0.9797543117227102
38: ... loglikelihood=-55526.69286884015 0.9797963969659226
39: ... loglikelihood=-55306.94735282102 0.9798152010107621
40: ... loglikelihood=-55096.48692031122 0.9798563908232679
41: ... loglikelihood=-54894.68284780714 0.9799029532200136
42: ... loglikelihood=-54700.963840494 0.9799378750175728
43: ... loglikelihood=-54514.80953871555 0.9799656333694788
44: ... loglikelihood=-54335.744892614406 0.9800005551670381
45: ... loglikelihood=-54163.33527156895 0.9800301043803574
46: ... loglikelihood=-53997.182198154995 0.9800551764401436
47: ... loglikelihood=-53836.91961491415 0.980082039361343
48: ... loglikelihood=-53682.210607423985 0.980112484005369
49: ... loglikelihood=-53532.74451955152 0.980140242357275
50: ... loglikelihood=-53388.23440690913 0.9801688961398878
51: ... loglikelihood=-53248.41478285541 0.9801921773382606
52: ... loglikelihood=-53113.03961847529 0.9802109813831001
53: ... loglikelihood=-52981.880563479055 0.9802351580121796
54: ... loglikelihood=-52854.7253600851 0.9802584392105524
55: ... loglikelihood=-52731.37642565477 0.9802727661018589
56: ... loglikelihood=-52611.64958353087 0.9803005244537649
57: ... loglikelihood=-52495.37292415569 0.9803148513450712
58: ... loglikelihood=-52382.38578113555 0.9803470868505105
59: ... loglikelihood=-52272.53780883427 0.9803748452024166
60: ... loglikelihood=-52165.68814994865 0.9803891720937229
61: ... loglikelihood=-52061.7046829472 0.9804043944157359
62: ... loglikelihood=-51960.46334051503 0.9804151395842157
63: ... loglikelihood=-51861.84749132724 0.9804393162132952
64: ... loglikelihood=-51765.74737831825 0.9804491659510683
65: ... loglikelihood=-51672.05960757943 0.9804634928423747
66: ... loglikelihood=-51580.686682513515 0.9804876694714542
67: ... loglikelihood=-51491.53657871175 0.9805046826548804
68: ... loglikelihood=-51404.52235540815 0.9805172186847735
69: ... loglikelihood=-51319.56179989248 0.9805315455760798
70: ... loglikelihood=-51236.577101627925 0.9805440816059728
71: ... loglikelihood=-51155.494553260556 0.9805584084972793
72: ... loglikelihood=-51076.24427590388 0.980569153665759
73: ... loglikelihood=-50998.75996642977 0.9805825851263587
74: ... loglikelihood=-50922.97866477339 0.9805951211562518
75: ... loglikelihood=-50848.84053937224 0.9806112389089714
76: ... loglikelihood=-50776.28868909037 0.9806264612309844
77: ... loglikelihood=-50705.2689602481 0.9806389972608774
78: ... loglikelihood=-50635.729777298875 0.9806470561372372
79: ... loglikelihood=-50567.62198610024 0.9806658601820769
80: ... loglikelihood=-50500.8987085974 0.9806685464741968
81: ... loglikelihood=-50435.51520800019 0.9806775007812633
82: ... loglikelihood=-50371.42876358994 0.9806837687962098
83: ... loglikelihood=-50308.59855431275 0.9806918276725697
84: ... loglikelihood=-50246.98555046764 0.9806989911182228
85: ... loglikelihood=-50186.55241287111 0.980703468271756
86: ... loglikelihood=-50127.26339882067 0.9807195860244757
87: ... loglikelihood=-50069.08427441567 0.9807312266236621
88: ... loglikelihood=-50011.9822326526 0.9807357037771953
89: ... loglikelihood=-49955.92581691934 0.9807446580842618
90: ... loglikelihood=-49900.88484943885 0.9807527169606216
91: ... loglikelihood=-49846.83036430355 0.9807634621291014
92: ... loglikelihood=-49793.734544757914 0.9807724164361679
93: ... loglikelihood=-49741.57066440427 0.9807786844511144
94: ... loglikelihood=-49690.31303207665 0.9807840570353543
95: ... loglikelihood=-49639.93694007888 0.9807948022038341
96: ... loglikelihood=-49590.418615580194 0.9808001747880739
97: ... loglikelihood=-49541.73517492774 0.9808073382337271
98: ... loglikelihood=-49493.86458067577 0.9808145016793803
99: ... loglikelihood=-49446.785601155134 0.9808234559864467
100: ... loglikelihood=-49400.477772387036 0.9808359920163399
model generated
model building complete....
annotated sentences: 20370
Performing NER with new model
it will do this for each iteration util you see
......
97: ... loglikelihood=-49140.50129715517 0.9808462362240823
98: ... loglikelihood=-49095.42289306763 0.9808641444693966
99: ... loglikelihood=-49051.095083380205 0.9808713077675223
100: ... loglikelihood=-49007.49834809576 0.9808748894165852
model generated
如果您看到带注释的句子停止更改,您可以更改num迭代,并且在您优化列表时,知识会在后续运行时停止更改。
HTH
答案 1 :(得分:2)
不幸的是,无法附加到模型。但是您可以使用模型来查找它可以找到的内容,并将它找到的命中数写入“已知实体”文件,并将句子写出到文件中。然后,您可以将您知道未被识别的其他名称添加到“已知实体”文件中(以及它们可能包含在句子文件中的更多句子)。然后你可以使用名为modelbuilder-addon的OpenNLP插件来使用句子文件和“已知实体”文件构建新模型
请参阅此帖子以获取代码示例。
OpenNLP: foreign names does not get recognized
这是一个非常新的插件,让我知道它是如何工作的。