我正在尝试使用Maximum Entropy分类器的OpenNLP实现,但似乎文档很缺乏,尽管这个库显然是为了易于使用而设计的,但我找不到输入文件格式的单个示例和/或规范(即训练集)。
有谁知道在哪里找到这个或最小的训练工作例子?
答案 0 :(得分:3)
OpenNLP的格式非常灵活。如果要在OpenNLP中使用MaxEnt分类器,则需要执行几个步骤。
以下是带注释的示例代码:
package example;
import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
import opennlp.tools.ml.maxent.GISTrainer;
import opennlp.tools.ml.model.Event;
import opennlp.tools.ml.model.MaxentModel;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.FilterObjectStream;
import opennlp.tools.util.MarkableFileInputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;
public class ReadData {
public static void main(String[] args) throws Exception{
// this is the data file ...
// the format is <LIST of FEATURES separated by spaces> <outcome>
// change the file to fit your needs
File f=new File("football.dat");
// we need to create an ObjectStream of events for the trainer..
// First create an InputStreamFactory -- given a file we can create an InputStream, required for resetting...
MarkableFileInputStreamFactory factory=new MarkableFileInputStreamFactory(f);
// create a PlainTextByLineInputStream -- Note: you can create your own Stream that can handle binary files or data that
// -- crosses two line...
ObjectStream<String> stream=new PlainTextByLineStream(factory, Charset.defaultCharset());
// Now you have a stream of string you need to convert it to a stream of events...
// I use a custom FilterObjectStream which simply takes a line, breaks it up into tokens,
// uses all except the last as the features [context] and the last token as the outcome class
ObjectStream<Event> eventStream=new FilterObjectStream<String, Event>(stream) {
@Override
public Event read() throws IOException {
String line=samples.read();
if (line==null) return null;
String[] parts=WhitespaceTokenizer.INSTANCE.tokenize(line);
String[] context=Arrays.copyOf(parts, parts.length-1);
System.out.println(parts[parts.length-1]+" "+Arrays.toString(context));
return new Event(parts[parts.length-1], context);
}
};
TrainingParameters parameters=new TrainingParameters();
// By default OpenNLP uses a cutoff of 5 (a feature has to occur 5 times before it is used)
// use 1 for my small dataset
parameters.put(GISTrainer.CUTOFF_PARAM, 1);
GISTrainer trainer=new GISTrainer();
// the report map is supposed to mark when default values are assigned...
Map<String,String> reportMap=new HashMap<>();
// DONT FORGET TO INITIALIZE THE TRAINER!!!
trainer.init(parameters, reportMap);
MaxentModel model=trainer.train(eventStream);
// Now we have a model -- you should test on a test set, but
// this is a toy example... so I am just resetting the eventstream.
eventStream.reset();
Event evt=null;
while ( (evt=eventStream.read())!=null ){
System.out.print(Arrays.toString(evt.getContext())+": ");
// Evaluate the context from the event using our model.
// you would want to calculate summary statistics..
double[] p=model.eval(evt.getContext());
System.out.print(model.getBestOutcome(p)+" ");
if (model.getBestOutcome(p).equals(evt.getOutcome())){
System.out.println("CORRECT");
}else{
System.out.println("INCORRECT");
}
}
}
}
Football.dat:
home=man_united Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_lost_previous man_united_won_previous arsenal
home=man_united Beckham=true Scholes=false Neville=true Henry=false Kanu=true Parlour=false Ferguson=tense Wengler=confident arsenal_won_previous man_united_lost_previous man_united
home=man_united Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=tense Wengler=tense arsenal_lost_previous man_united_won_previous tie
home=man_united Beckham=true Scholes=true Neville=false Henry=true Kanu=false Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous tie
home=man_united Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous arsenal
home=man_united Beckham=false Scholes=true Neville=true Henry=false Kanu=true Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous man_united
home=man_united Beckham=true Scholes=true Neville=false Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous man_united
home=arsenal Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_lost_previous man_united_won_previous arsenal
home=arsenal Beckham=true Scholes=false Neville=true Henry=false Kanu=true Parlour=false Ferguson=tense Wengler=confident arsenal_won_previous man_united_lost_previous arsenal
home=arsenal Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=tense Wengler=tense arsenal_lost_previous man_united_won_previous tie
home=arsenal Beckham=true Scholes=true Neville=false Henry=true Kanu=false Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous man_united
home=arsenal Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous arsenal
home=arsenal Beckham=false Scholes=true Neville=true Henry=false Kanu=true Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous man_united
home=arsenal Beckham=true Scholes=true Neville=false Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous arsenal
希望它有助于