我是weka的初学者。
我设法从磁盘导入数据集(按类别一个文件夹,文件夹中与此类别相关的所有文本),使用tokenizer应用StringToWordVector,训练Naive Multniomial分类程序......代码如下(它是c#但Java当然可以。)
但是,我几乎找不到有关如何在项目中使用分类程序的信息。假设我有一个未知类别的文本,由用户输入,我如何才能将分类程序应用于此文本并推断它所属的类别? (代码" //在下面做什么")。 任何帮助将不胜感激;-)
提前致谢
于连
string filepath = @"C:\Users\Julien\Desktop\Meal\";
ClassificationDatasetHelper classHelper = new ClassificationDatasetHelper();
weka.core.converters.TextDirectoryLoader tdl = new
weka.core.converters.TextDirectoryLoader();
tdl.setDirectory(new java.io.File(filepath));
tdl.setCharSet("UTF-8");
weka.core.Instances insts = tdl.getDataSet();
weka.filters.unsupervised.attribute.StringToWordVector swv = new weka.filters.unsupervised.attribute.StringToWordVector();
swv.setInputFormat(insts);
swv.setDoNotOperateOnPerClassBasis(false);
swv.setOutputWordCounts(true);
swv.setWordsToKeep(1000);
swv.setIDFTransform(true);
swv.setMinTermFreq(1);
swv.setDoNotOperateOnPerClassBasis(false);
swv.setPeriodicPruning(-1);
weka.core.tokenizers.NGramTokenizer tokenizer = new weka.core.tokenizers.NGramTokenizer();
tokenizer.setNGramMinSize(2);
tokenizer.setNGramMaxSize(2);
swv.setTokenizer(tokenizer);
insts = weka.filters.Filter.useFilter(insts, swv);
insts.setClassIndex(0);
weka.classifiers.Classifier cl = new weka.classifiers.bayes.NaiveBayesMultinomial();
int trainSize = insts.numInstances() * percentSplit / 100;
int testSize = insts.numInstances() - trainSize;
weka.core.Instances train = new weka.core.Instances(insts, 0, trainSize);
cl.buildClassifier(train);
string s = "Try to classify this text";
weka.core.Instance instanceToClassify = new weka.core.Instance();
// what to do here
// ???
double predictedClass = cl.classifyInstance(instanceToClassify);
由于
答案 0 :(得分:0)
在Java应用程序中学习如何使用Weka的最佳位置是官方的Weka wiki。
https://waikato.github.io/weka-wiki/use_weka_in_your_java_code/
基本上,你提供了一个新的数据集(分类器将忽略category属性),你要求它为你标记每个实例,就像这样
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileReader;
import java.io.FileWriter;
import weka.core.Instances;
...
// load unlabeled data
Instances unlabeled = new Instances(
new BufferedReader(
new FileReader("/some/where/unlabeled.arff")));
// set class attribute
unlabeled.setClassIndex(unlabeled.numAttributes() - 1);
// create copy
Instances labeled = new Instances(unlabeled);
// label instances
for (int i = 0; i < unlabeled.numInstances(); i++) {
double clsLabel = tree.classifyInstance(unlabeled.instance(i));
labeled.instance(i).setClassValue(clsLabel);
}
// save labeled data
BufferedWriter writer = new BufferedWriter(
new FileWriter("/some/where/labeled.arff"));
writer.write(labeled.toString());
writer.newLine();
writer.flush();
writer.close();