分类器经过训练后,用weka对文本进行分类

时间:2016-05-27 15:13:38

标签: machine-learning weka

我是weka的初学者。

我设法从磁盘导入数据集(按类别一个文件夹,文件夹中与此类别相关的所有文本),使用tokenizer应用StringToWordVector,训练Naive Multniomial分类程序......代码如下(它是c#但Java当然可以。)

但是,我几乎找不到有关如何在项目中使用分类程序的信息。假设我有一个未知类别的文本,由用户输入,我如何才能将分类程序应用于此文本并推断它所属的类别? (代码" //在下面做什么")。 任何帮助将不胜感激;-)

提前致谢

于连

string filepath = @"C:\Users\Julien\Desktop\Meal\";
    ClassificationDatasetHelper classHelper = new ClassificationDatasetHelper();
    weka.core.converters.TextDirectoryLoader tdl = new
    weka.core.converters.TextDirectoryLoader();
    tdl.setDirectory(new java.io.File(filepath));
    tdl.setCharSet("UTF-8");

    weka.core.Instances insts = tdl.getDataSet();

    weka.filters.unsupervised.attribute.StringToWordVector swv = new weka.filters.unsupervised.attribute.StringToWordVector();
    swv.setInputFormat(insts);
    swv.setDoNotOperateOnPerClassBasis(false);
    swv.setOutputWordCounts(true);
    swv.setWordsToKeep(1000);
    swv.setIDFTransform(true);
    swv.setMinTermFreq(1);
    swv.setDoNotOperateOnPerClassBasis(false);
    swv.setPeriodicPruning(-1);
    weka.core.tokenizers.NGramTokenizer tokenizer = new weka.core.tokenizers.NGramTokenizer();
    tokenizer.setNGramMinSize(2);
    tokenizer.setNGramMaxSize(2);
    swv.setTokenizer(tokenizer);

    insts = weka.filters.Filter.useFilter(insts, swv);

    insts.setClassIndex(0);

    weka.classifiers.Classifier cl = new weka.classifiers.bayes.NaiveBayesMultinomial();
    int trainSize = insts.numInstances() * percentSplit / 100;
    int testSize = insts.numInstances() - trainSize;
    weka.core.Instances train = new weka.core.Instances(insts, 0, trainSize);

    cl.buildClassifier(train);
    string s = "Try to classify this text";
    weka.core.Instance instanceToClassify = new weka.core.Instance();

    // what to do here
    // ???

    double predictedClass = cl.classifyInstance(instanceToClassify);

由于

1 个答案:

答案 0 :(得分:0)

在Java应用程序中学习如何使用Weka的最佳位置是官方的Weka wiki。

https://waikato.github.io/weka-wiki/use_weka_in_your_java_code/

基本上,你提供了一个新的数据集(分类器将忽略category属性),你要求它为你标记每个实例,就像这样

import java.io.BufferedReader;
 import java.io.BufferedWriter;
 import java.io.FileReader;
 import java.io.FileWriter;
 import weka.core.Instances;
 ...
 // load unlabeled data
 Instances unlabeled = new Instances(
                         new BufferedReader(
                           new FileReader("/some/where/unlabeled.arff")));

 // set class attribute
 unlabeled.setClassIndex(unlabeled.numAttributes() - 1);

 // create copy
 Instances labeled = new Instances(unlabeled);

 // label instances
 for (int i = 0; i < unlabeled.numInstances(); i++) {
   double clsLabel = tree.classifyInstance(unlabeled.instance(i));
   labeled.instance(i).setClassValue(clsLabel);
 }
 // save labeled data
 BufferedWriter writer = new BufferedWriter(
                           new FileWriter("/some/where/labeled.arff"));
 writer.write(labeled.toString());
 writer.newLine();
 writer.flush();
 writer.close();