我发现apache open NLP有一个文档分类程序,它具有最大熵算法,可用于情感分析。
通过以下代码,我能够成功训练一个分类器,可以预测他们情绪的基本文本。
import java.io.File;
import java.io.IOException;
import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizerME;
import opennlp.tools.doccat.DocumentSampleStream;
import opennlp.tools.util.InputStreamFactory;
import opennlp.tools.util.MarkableFileInputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
public class OpenNLPCategorizer {
private DoccatModel model;
public static void main(String[] args) {
OpenNLPCategorizer twitterCategorizer = new OpenNLPCategorizer();
twitterCategorizer.trainModel();
twitterCategorizer.classifyNewTweet("This movie is good");
}
private void trainModel() {
InputStreamFactory inputStreamFactory;
try {
inputStreamFactory = new MarkableFileInputStreamFactory(
new File("datasets/en-sent.csv"));
ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, "UTF-8");
ObjectStream sampleStream = new DocumentSampleStream(lineStream);
model = DocumentCategorizerME.train("en", sampleStream);
} catch (IOException e) {
e.printStackTrace();
}
}
private void classifyNewTweet(String tweet) {
DocumentCategorizerME myCategorizer = new DocumentCategorizerME(model);
double[] outcomes = myCategorizer.categorize(tweet);
String category = myCategorizer.getBestCategory(outcomes);
if (category.equalsIgnoreCase("1")) {
System.out.println("The tweet is positive :) ");
} else {
System.out.println("The tweet is negative :( ");
}
}
}
我使用的数据集很简单,包含10行。
我的问题
1)如何准备数据集here,即polarity v2.0
进行培训?或者一般来说如何为OpenNLP准备数据集。我在网上找不到任何指南。
2)如何验证准确性? Apache Spark上的Naive Bayes可以将数据集拆分为70:30。如何以这种方式拆分数据集以便可以测试精度?