WEKA IBk的EditDistance(Levenshtein距离)结果错误-JAVA

时间:2018-12-11 16:02:09

标签: java weka knn

对于WEKA来说还很陌生,今天尝试使用IBk算法通过距离函数Levenshtein-Distance将字符串分类为不同的类。但是我得到了非常糟糕的结果。我的输入总是被分配相同的类(b类),这根本是不正确的。有人可以告诉我我在做什么错吗?

目前,我的代码非常简单:

        CSVLoader loader = new CSVLoader();
        loader.setSource(new File("current_path"));
        Instances data = loader.getDataSet();

        int numberAttributes = data.numAttributes();
        data.setClassIndex(data.numAttributes() - 1);
        EditDistance newWeka = new EditDistance();

        IBk ibk = new IBk(1);
        ((IBk) ibk).getNearestNeighbourSearchAlgorithm().setDistanceFunction(newWeka); 
        ibk.setCrossValidate(false);
        ibk.setMeanSquared(false);
        ibk.buildClassifier(data);

        System.out.println(ibk);


        Evaluation eval = new Evaluation(data);
        eval.evaluateModel(ibk, data);

结果:

** KNN Demo  **

Correctly Classified Instances           4               50      %
Incorrectly Classified Instances         4               50      %
Kappa statistic                          0     
Mean absolute error                      0.398 
Root mean squared error                  0.4449
Relative absolute error                 97.2913 %
Root relative squared error             99.5586 %
Total Number of Instances                8     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,000    0,000    ?          0,000    ?          ?        0,500     0,375     Surname
                 1,000    1,000    0,500      1,000    0,667      ?        0,500     0,500     Firstname
                 0,000    0,000    ?          0,000    ?          ?        0,500     0,125     Job
Weighted Avg.    0,500    0,500    ?          0,500    ?          ?        0,500     0,406     

=== Confusion Matrix ===

 a b c   <-- classified as
 0 3 0 | a = Surname
 0 4 0 | b = Firstname
 0 1 0 | c = Job

文件:

"Attribute","class"
"Wellbrock","Surname"
"Kohler","Surname"
"Sanger","Surname"
"Jan","Firstname"
"Anna","Firstname"
"Tim","Firstname"
"Schmidt","Firstname"
"Consultant","Job"

非常感谢您的帮助

1 个答案:

答案 0 :(得分:0)

我自己找到了解决方案。问题在于,对于JAVA API,标准搜索算法似乎是Zero-R,它始终将所有属性归类为最现有的类。

我现在将此行添加到了代码中,结果如预期的那样: ibk.setNearestNeighbourSearchAlgorithm(new LinearNNSearch());

=== Confusion Matrix ===

 a b c   <-- classified as
 3 0 0 | a = Surname
 0 4 0 | b = Firstname
 0 0 6 | c = Job