我正在开展一个方面级别的情绪分析项目。 我现在处于方面术语提取模块的实现阶段,并使用Stanford NER来训练我自己的自定义模型,使用带有1000个TripAdvisor旅游评论的带注释数据集。
我设法培养了一个定制的NER。其代码如下;
import java.util.Properties;
import edu.stanford.nlp.ie.crf.CRFClassifier;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.sequences.SeqClassifierFlags;
import edu.stanford.nlp.util.StringUtils;
public class NERTrainer {
public static void main(String[] args) {
// TODO Auto-generated method stub
String prop = "c:\\Users\\User\\Downloads\\properties.prop";
Properties props = StringUtils.propFileToProperties(prop);
String to = props.getProperty("serializeTo");
props.setProperty("serializeTo", "c:\\Users\\User\\Desktop\\ner-travel-planner-model.ser.gz");
SeqClassifierFlags flags = new SeqClassifierFlags(props);
CRFClassifier<CoreLabel> crf = new CRFClassifier<CoreLabel>(flags);
crf.train();
crf.serializeClassifier("c:\\Users\\User\\Desktop\\ner-travel-planner-model.ser.gz");
}
我的属性文件:(使用斯坦福大学网站上提供的默认文件)
trainFile = IOB.tsv
#serializeTo = ner-model.ser.gz
map = word=0,answer=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useDisjunctive=true
日志显示它已成功训练。
usePrevSequences=true
useClassFeature=true
useTypeSeqs2=true
useSequences=true
wordShape=chris2useLC
useTypeySequences=true
useDisjunctive=true
noMidNGrams=true
serializeTo=c:\Users\User\Desktop\ner-travel-planner-model.ser.gz
maxNGramLeng=6
useNGrams=true
usePrev=true
useNext=true
maxLeft=1
trainFile=IOB.tsv
map=word=0,answer=1
useWord=true
useTypeSeqs=true
numFeatures = 114317
Time to convert docs to feature indices: 2.0 seconds
numClasses: 3 [0=O,1=I-TERM,2=B-TERM]
numDocuments: 2
numDatums: 56513
numFeatures: 114317
Time to convert docs to data/labels: 1.1 seconds
numWeights: 596487
QNMinimizer called on double function of 596487 variables, using M = 25.
An explanation of the output:
Iter The number of iterations
evals The number of function evaluations
SCALING <D> Diagonal scaling was used; <I> Scaled Identity
LINESEARCH [## M steplength] Minpack linesearch
1-Function value was too high
2-Value ok, gradient positive, positive curvature
3-Value ok, gradient negative, positive curvature
4-Value ok, gradient negative, negative curvature
[.. B] Backtracking
VALUE The current function value
TIME Total elapsed time
|GNORM| The current norm of the gradient
{RELNORM} The ratio of the current to initial gradient norms
AVEIMPROVE The average improvement / current value
EVALSCORE The last available eval score
Iter ## evals ## <SCALING> [LINESEARCH] VALUE TIME |GNORM| {RELNORM} AVEIMPROVE EVALSCORE
Iter 1 evals 1 <D> [11M 8.212E-5] 1.714E5 1.06s |1.080E4| {1.082E-1} 0.000E0 -
Iter 2 evals 4 <D> [33131M 6.201E0] 1.204E5 2.78s |8.770E3| {8.784E-2} 2.120E-1 -
Iter 3 evals 10 <D> [1M 2.210E-2] 1.158E5 3.36s |4.819E3| {4.826E-2} 1.603E-1 -
.
.
.
Iter 175 evals 207 <D> [M 1.000E0] 2.132E3 74.42s
QNMinimizer terminated due to average improvement: | newest_val - previous_val | / |newestVal| < TOL
Total time spent in optimization: 74.43s
可以找到分类器文件here。
训练数据采用IOB表示法;
B-TERM - begining of aspect term label
I-TERM - continuation of aspect term label
O - Default 'not a keyword' label
示例培训数据;
so O
peaceful B-TERM
interesting B-TERM
and I-TERM
informative I-TERM
it O
had O
been O
raining B-TERM
so O
we O
had O
it's O
still O
a O
place B-TERM
of I-TERM
worship I-TERM
after O
that O
just O
walk B-TERM
down O
to O
jungle O
beach O
and O
grab O
yourself O
a O
cold B-TERM
beer I-TERM
or O
two O
and O
a O
cool O
off O
in O
the O
surf B-TERM
但是当我尝试测试时,它似乎没有用。所有令牌都只用O标记。
import edu.stanford.nlp.ie.NERClassifierCombiner;
import edu.stanford.nlp.ie.AbstractSequenceClassifier;
import edu.stanford.nlp.ie.crf.*;
import edu.stanford.nlp.io.IOUtils;
import edu.stanford.nlp.ling.CoreLabel;
import java.io.IOException;
import java.util.List;
public class NERDemo {
public static void main(String[] args) throws IOException {
String serializedClassifier = "c:\\Users\\User\\Desktop\\ner-travel-planner-model.ser.gz";
// String serializedClassifier2 = "/local/stanford-ner-2015-01-30/classifiers/english.muc.7class.distsim.crf.ser.gz";
if (args.length > 0) {
serializedClassifier = args[0];
}
NERClassifierCombiner classifier = new NERClassifierCombiner(false, false,
serializedClassifier);
String fileContents = IOUtils.slurpFile("c:\\Users\\User\\Desktop\\test-ner.txt");
List<List<CoreLabel>> out = classifier.classify(fileContents);
int i = 0;
for (List<CoreLabel> lcl : out) {
i++;
int j = 0;
for (CoreLabel cl : lcl) {
j++;
System.out.printf("%d:%d: %s%n", i, j,
cl.toShorterString("Text", "CharacterOffsetBegin", "CharacterOffsetEnd", "NamedEntityTag"));
}
}
}
输出:
Loading classifier from c:\Users\User\Desktop\ner-travel-planner-model.ser.gz ... done [0.4 sec].
1:1: [Text=If CharacterOffsetBegin=0 CharacterOffsetEnd=2 NamedEntityTag=O]
1:2: [Text=you CharacterOffsetBegin=3 CharacterOffsetEnd=6 NamedEntityTag=O]
1:3: [Text=happen CharacterOffsetBegin=7 CharacterOffsetEnd=13 NamedEntityTag=O]
1:4: [Text=to CharacterOffsetBegin=14 CharacterOffsetEnd=16 NamedEntityTag=O]
1:5: [Text=visit CharacterOffsetBegin=17 CharacterOffsetEnd=22 NamedEntityTag=O]
1:6: [Text=Kandy CharacterOffsetBegin=23 CharacterOffsetEnd=28 NamedEntityTag=O]
1:7: [Text=the CharacterOffsetBegin=30 CharacterOffsetEnd=33 NamedEntityTag=O]
1:8: [Text=Tea CharacterOffsetBegin=34 CharacterOffsetEnd=37 NamedEntityTag=O]
1:9: [Text=Museum CharacterOffsetBegin=38 CharacterOffsetEnd=44 NamedEntityTag=O]
1:10: [Text=is CharacterOffsetBegin=45 CharacterOffsetEnd=47 NamedEntityTag=O]
1:11: [Text=a CharacterOffsetBegin=48 CharacterOffsetEnd=49 NamedEntityTag=O]
1:12: [Text=must CharacterOffsetBegin=50 CharacterOffsetEnd=54 NamedEntityTag=O]
1:13: [Text=visit CharacterOffsetBegin=55 CharacterOffsetEnd=60 NamedEntityTag=O]
1:14: [Text=place CharacterOffsetBegin=61 CharacterOffsetEnd=66 NamedEntityTag=O]
1:15: [Text=it CharacterOffsetBegin=68 CharacterOffsetEnd=70 NamedEntityTag=O]
1:16: [Text=is CharacterOffsetBegin=71 CharacterOffsetEnd=73 NamedEntityTag=O]
1:17: [Text=located CharacterOffsetBegin=74 CharacterOffsetEnd=81 NamedEntityTag=O]
1:18: [Text=in CharacterOffsetBegin=82 CharacterOffsetEnd=84 NamedEntityTag=O]
1:19: [Text=a CharacterOffsetBegin=85 CharacterOffsetEnd=86 NamedEntityTag=O]
1:20: [Text=lovely CharacterOffsetBegin=87 CharacterOffsetEnd=93 NamedEntityTag=O]
1:21: [Text=place CharacterOffsetBegin=94 CharacterOffsetEnd=99 NamedEntityTag=O]
1:22: [Text=with CharacterOffsetBegin=100 CharacterOffsetEnd=104 NamedEntityTag=O]
1:23: [Text=a CharacterOffsetBegin=105 CharacterOffsetEnd=106 NamedEntityTag=O]
1:24: [Text=breathtaking CharacterOffsetBegin=107 CharacterOffsetEnd=119 NamedEntityTag=O]
1:25: [Text=view CharacterOffsetBegin=120 CharacterOffsetEnd=124 NamedEntityTag=O]
1:26: [Text=. CharacterOffsetBegin=124 CharacterOffsetEnd=125 NamedEntityTag=O]
2:1: [Text=This CharacterOffsetBegin=126 CharacterOffsetEnd=130 NamedEntityTag=O]
2:2: [Text=place CharacterOffsetBegin=131 CharacterOffsetEnd=136 NamedEntityTag=O]
2:3: [Text=will CharacterOffsetBegin=137 CharacterOffsetEnd=141 NamedEntityTag=O]
2:4: [Text=tell CharacterOffsetBegin=142 CharacterOffsetEnd=146 NamedEntityTag=O]
2:5: [Text=you CharacterOffsetBegin=147 CharacterOffsetEnd=150 NamedEntityTag=O]
2:6: [Text=everything CharacterOffsetBegin=151 CharacterOffsetEnd=161 NamedEntityTag=O]
2:7: [Text=you CharacterOffsetBegin=162 CharacterOffsetEnd=165 NamedEntityTag=O]
2:8: [Text=should CharacterOffsetBegin=166 CharacterOffsetEnd=172 NamedEntityTag=O]
2:9: [Text=know CharacterOffsetBegin=173 CharacterOffsetEnd=177 NamedEntityTag=O]
2:10: [Text=about CharacterOffsetBegin=178 CharacterOffsetEnd=183 NamedEntityTag=O]
2:11: [Text=the CharacterOffsetBegin=184 CharacterOffsetEnd=187 NamedEntityTag=O]
2:12: [Text=history CharacterOffsetBegin=188 CharacterOffsetEnd=195 NamedEntityTag=O]
2:13: [Text=of CharacterOffsetBegin=196 CharacterOffsetEnd=198 NamedEntityTag=O]
2:14: [Text=Tea CharacterOffsetBegin=199 CharacterOffsetEnd=202 NamedEntityTag=O]
2:15: [Text=in CharacterOffsetBegin=203 CharacterOffsetEnd=205 NamedEntityTag=O]
2:16: [Text=Sri CharacterOffsetBegin=206 CharacterOffsetEnd=209 NamedEntityTag=O]
2:17: [Text=Lanka CharacterOffsetBegin=210 CharacterOffsetEnd=215 NamedEntityTag=O]
2:18: [Text=. CharacterOffsetBegin=215 CharacterOffsetEnd=216 NamedEntityTag=O]
3:1: [Text=There CharacterOffsetBegin=217 CharacterOffsetEnd=222 NamedEntityTag=O]
3:2: [Text=are CharacterOffsetBegin=223 CharacterOffsetEnd=226 NamedEntityTag=O]
3:3: [Text=guides CharacterOffsetBegin=227 CharacterOffsetEnd=233 NamedEntityTag=O]
3:4: [Text=in CharacterOffsetBegin=234 CharacterOffsetEnd=236 NamedEntityTag=O]
3:5: [Text=the CharacterOffsetBegin=237 CharacterOffsetEnd=240 NamedEntityTag=O]
3:6: [Text=building CharacterOffsetBegin=241 CharacterOffsetEnd=249 NamedEntityTag=O]
3:7: [Text=who CharacterOffsetBegin=250 CharacterOffsetEnd=253 NamedEntityTag=O]
3:8: [Text=will CharacterOffsetBegin=254 CharacterOffsetEnd=258 NamedEntityTag=O]
3:9: [Text=take CharacterOffsetBegin=259 CharacterOffsetEnd=263 NamedEntityTag=O]
3:10: [Text=you CharacterOffsetBegin=264 CharacterOffsetEnd=267 NamedEntityTag=O]
3:11: [Text=around CharacterOffsetBegin=268 CharacterOffsetEnd=274 NamedEntityTag=O]
3:12: [Text=explaining CharacterOffsetBegin=275 CharacterOffsetEnd=285 NamedEntityTag=O]
3:13: [Text=what CharacterOffsetBegin=286 CharacterOffsetEnd=290 NamedEntityTag=O]
3:14: [Text=they CharacterOffsetBegin=291 CharacterOffsetEnd=295 NamedEntityTag=O]
3:15: [Text=have CharacterOffsetBegin=296 CharacterOffsetEnd=300 NamedEntityTag=O]
3:16: [Text=in CharacterOffsetBegin=301 CharacterOffsetEnd=303 NamedEntityTag=O]
3:17: [Text=each CharacterOffsetBegin=304 CharacterOffsetEnd=308 NamedEntityTag=O]
3:18: [Text=floor CharacterOffsetBegin=309 CharacterOffsetEnd=314 NamedEntityTag=O]
3:19: [Text=. CharacterOffsetBegin=314 CharacterOffsetEnd=315 NamedEntityTag=O]
4:1: [Text=You CharacterOffsetBegin=316 CharacterOffsetEnd=319 NamedEntityTag=O]
4:2: [Text=could CharacterOffsetBegin=320 CharacterOffsetEnd=325 NamedEntityTag=O]
4:3: [Text=enjoy CharacterOffsetBegin=326 CharacterOffsetEnd=331 NamedEntityTag=O]
4:4: [Text=a CharacterOffsetBegin=332 CharacterOffsetEnd=333 NamedEntityTag=O]
4:5: [Text=cup CharacterOffsetBegin=334 CharacterOffsetEnd=337 NamedEntityTag=O]
4:6: [Text=of CharacterOffsetBegin=338 CharacterOffsetEnd=340 NamedEntityTag=O]
4:7: [Text=good CharacterOffsetBegin=341 CharacterOffsetEnd=345 NamedEntityTag=O]
4:8: [Text=tea CharacterOffsetBegin=346 CharacterOffsetEnd=349 NamedEntityTag=O]
4:9: [Text=in CharacterOffsetBegin=350 CharacterOffsetEnd=352 NamedEntityTag=O]
4:10: [Text=the CharacterOffsetBegin=353 CharacterOffsetEnd=356 NamedEntityTag=O]
4:11: [Text=restaurant CharacterOffsetBegin=357 CharacterOffsetEnd=367 NamedEntityTag=O]
4:12: [Text=upstairs CharacterOffsetBegin=368 CharacterOffsetEnd=376 NamedEntityTag=O]
4:13: [Text=but CharacterOffsetBegin=378 CharacterOffsetEnd=381 NamedEntityTag=O]
4:14: [Text=they CharacterOffsetBegin=382 CharacterOffsetEnd=386 NamedEntityTag=O]
4:15: [Text=cant CharacterOffsetBegin=387 CharacterOffsetEnd=391 NamedEntityTag=O]
4:16: [Text=make CharacterOffsetBegin=392 CharacterOffsetEnd=396 NamedEntityTag=O]
4:17: [Text=a CharacterOffsetBegin=397 CharacterOffsetEnd=398 NamedEntityTag=O]
4:18: [Text=proper CharacterOffsetBegin=399 CharacterOffsetEnd=405 NamedEntityTag=O]
4:19: [Text=tea CharacterOffsetBegin=406 CharacterOffsetEnd=409 NamedEntityTag=O]
4:20: [Text=even CharacterOffsetBegin=411 CharacterOffsetEnd=415 NamedEntityTag=O]
4:21: [Text=if CharacterOffsetBegin=416 CharacterOffsetEnd=418 NamedEntityTag=O]
4:22: [Text=it CharacterOffsetBegin=419 CharacterOffsetEnd=421 NamedEntityTag=O]
4:23: [Text=saves CharacterOffsetBegin=422 CharacterOffsetEnd=427 NamedEntityTag=O]
4:24: [Text=their CharacterOffsetBegin=428 CharacterOffsetEnd=433 NamedEntityTag=O]
4:25: [Text=life CharacterOffsetBegin=434 CharacterOffsetEnd=438 NamedEntityTag=O]
4:26: [Text=. CharacterOffsetBegin=438 CharacterOffsetEnd=439 NamedEntityTag=O]
我似乎无法弄清楚我做错了什么。请帮忙。