Apache Open NLP

时间:2016-11-22 10:12:25

标签: machine-learning sentiment-analysis opennlp

我发现apache open NLP有一个文档分类程序,它具有最大熵算法,可用于情感分析。

通过以下代码,我能够成功训练一个分类器,可以预测他们情绪的基本文本。

import java.io.File;
import java.io.IOException;

import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizerME;
import opennlp.tools.doccat.DocumentSampleStream;
import opennlp.tools.util.InputStreamFactory;
import opennlp.tools.util.MarkableFileInputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;


public class OpenNLPCategorizer {
    private DoccatModel model;

    public static void main(String[] args) {
        OpenNLPCategorizer twitterCategorizer = new OpenNLPCategorizer();
        twitterCategorizer.trainModel();
        twitterCategorizer.classifyNewTweet("This movie is good");
    }

    private void trainModel() {
        InputStreamFactory inputStreamFactory;
        try {
            inputStreamFactory = new MarkableFileInputStreamFactory(
                    new File("datasets/en-sent.csv"));
            ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, "UTF-8");
            ObjectStream sampleStream = new DocumentSampleStream(lineStream);

            model = DocumentCategorizerME.train("en", sampleStream);

        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    private void classifyNewTweet(String tweet) {
        DocumentCategorizerME myCategorizer = new DocumentCategorizerME(model);
        double[] outcomes = myCategorizer.categorize(tweet);
        String category = myCategorizer.getBestCategory(outcomes);

        if (category.equalsIgnoreCase("1")) {
            System.out.println("The tweet is positive :) ");
        } else {
            System.out.println("The tweet is negative :( ");
        }
    }
}

我使用的数据集很简单,包含10行。

我的问题

1)如何准备数据集here,即polarity v2.0进行培训?或者一般来说如何为OpenNLP准备数据集。我在网上找不到任何指南。

2)如何验证准确性? Apache Spark上的Naive Bayes可以将数据集拆分为70:30。如何以这种方式拆分数据集以便可以测试精度?

0 个答案:

没有答案