Question

我正在尝试使用Maximum Entropy分类器的OpenNLP实现，但似乎文档很缺乏，尽管这个库显然是为了易于使用而设计的，但我找不到输入文件格式的单个示例和/或规范（即训练集）。

有谁知道在哪里找到这个或最小的训练工作例子？

Answer 1

OpenNLP的格式非常灵活。如果要在OpenNLP中使用MaxEnt分类器，则需要执行几个步骤。

以下是带注释的示例代码：

package example;

import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;

import opennlp.tools.ml.maxent.GISTrainer;
import opennlp.tools.ml.model.Event;
import opennlp.tools.ml.model.MaxentModel;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.FilterObjectStream;
import opennlp.tools.util.MarkableFileInputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;

public class ReadData {


    public static void main(String[] args) throws Exception{

        // this is the data file ...
        // the format is <LIST of FEATURES separated by spaces> <outcome>
        // change the file to fit your needs
        File f=new File("football.dat");

        // we need to create an ObjectStream of events for the trainer..
        //   First create an InputStreamFactory -- given a file we can create an InputStream, required for resetting...
        MarkableFileInputStreamFactory factory=new MarkableFileInputStreamFactory(f);
        // create a PlainTextByLineInputStream -- Note: you can create your own Stream that can handle binary files or data that
        //                                     --       crosses two line...
        ObjectStream<String> stream=new PlainTextByLineStream(factory, Charset.defaultCharset());
        //  Now you have a stream of string you need to convert it to a stream of events...
        //  I use a custom FilterObjectStream which simply takes a line, breaks it up into tokens,
        //  uses all except the last as the features [context] and the last token as the outcome class
        ObjectStream<Event> eventStream=new FilterObjectStream<String, Event>(stream) {
            @Override
            public Event read() throws IOException {
                String line=samples.read();
                if (line==null) return null;

                String[] parts=WhitespaceTokenizer.INSTANCE.tokenize(line);
                String[] context=Arrays.copyOf(parts, parts.length-1);

                System.out.println(parts[parts.length-1]+" "+Arrays.toString(context));
                return new Event(parts[parts.length-1], context);
            }
        };


        TrainingParameters parameters=new TrainingParameters();
        // By default OpenNLP uses a cutoff of 5 (a feature has to occur 5 times before it is used)
        // use 1 for my small dataset
        parameters.put(GISTrainer.CUTOFF_PARAM, 1);

        GISTrainer trainer=new GISTrainer();
        // the report map is supposed to mark when default values are assigned...
        Map<String,String> reportMap=new HashMap<>();
        // DONT FORGET TO INITIALIZE THE TRAINER!!!
        trainer.init(parameters, reportMap);
        MaxentModel model=trainer.train(eventStream);

        // Now we have a model -- you should test on a test set, but 
        // this is a toy example... so I am just resetting the eventstream.
        eventStream.reset();
        Event evt=null;
        while ( (evt=eventStream.read())!=null ){
            System.out.print(Arrays.toString(evt.getContext())+":  ");
            // Evaluate the context from the event using our model.
            // you would want to calculate summary statistics..
            double[] p=model.eval(evt.getContext());
            System.out.print(model.getBestOutcome(p)+"  ");
            if (model.getBestOutcome(p).equals(evt.getOutcome())){
                System.out.println("CORRECT");
            }else{
                System.out.println("INCORRECT");                
            }
        }

    }

}

Football.dat：

home=man_united Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_lost_previous man_united_won_previous arsenal
home=man_united Beckham=true Scholes=false Neville=true Henry=false Kanu=true Parlour=false Ferguson=tense Wengler=confident arsenal_won_previous man_united_lost_previous man_united
home=man_united Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=tense Wengler=tense arsenal_lost_previous man_united_won_previous tie
home=man_united Beckham=true Scholes=true Neville=false Henry=true Kanu=false Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous tie
home=man_united Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous arsenal
home=man_united Beckham=false Scholes=true Neville=true Henry=false Kanu=true Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous man_united
home=man_united Beckham=true Scholes=true Neville=false Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous man_united
home=arsenal Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_lost_previous man_united_won_previous arsenal
home=arsenal Beckham=true Scholes=false Neville=true Henry=false Kanu=true Parlour=false Ferguson=tense Wengler=confident arsenal_won_previous man_united_lost_previous arsenal
home=arsenal Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=tense Wengler=tense arsenal_lost_previous man_united_won_previous tie
home=arsenal Beckham=true Scholes=true Neville=false Henry=true Kanu=false Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous man_united
home=arsenal Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous arsenal
home=arsenal Beckham=false Scholes=true Neville=true Henry=false Kanu=true Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous man_united
home=arsenal Beckham=true Scholes=true Neville=false Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous arsenal

希望它有助于

MaxEnt OpenNLP实现的输入格式？

1 个答案: