对于WEKA来说还很陌生,今天尝试使用IBk算法通过距离函数Levenshtein-Distance将字符串分类为不同的类。但是我得到了非常糟糕的结果。我的输入总是被分配相同的类(b类),这根本是不正确的。有人可以告诉我我在做什么错吗?
目前,我的代码非常简单:
CSVLoader loader = new CSVLoader();
loader.setSource(new File("current_path"));
Instances data = loader.getDataSet();
int numberAttributes = data.numAttributes();
data.setClassIndex(data.numAttributes() - 1);
EditDistance newWeka = new EditDistance();
IBk ibk = new IBk(1);
((IBk) ibk).getNearestNeighbourSearchAlgorithm().setDistanceFunction(newWeka);
ibk.setCrossValidate(false);
ibk.setMeanSquared(false);
ibk.buildClassifier(data);
System.out.println(ibk);
Evaluation eval = new Evaluation(data);
eval.evaluateModel(ibk, data);
结果:
** KNN Demo **
Correctly Classified Instances 4 50 %
Incorrectly Classified Instances 4 50 %
Kappa statistic 0
Mean absolute error 0.398
Root mean squared error 0.4449
Relative absolute error 97.2913 %
Root relative squared error 99.5586 %
Total Number of Instances 8
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0,000 0,000 ? 0,000 ? ? 0,500 0,375 Surname
1,000 1,000 0,500 1,000 0,667 ? 0,500 0,500 Firstname
0,000 0,000 ? 0,000 ? ? 0,500 0,125 Job
Weighted Avg. 0,500 0,500 ? 0,500 ? ? 0,500 0,406
=== Confusion Matrix ===
a b c <-- classified as
0 3 0 | a = Surname
0 4 0 | b = Firstname
0 1 0 | c = Job
文件:
"Attribute","class"
"Wellbrock","Surname"
"Kohler","Surname"
"Sanger","Surname"
"Jan","Firstname"
"Anna","Firstname"
"Tim","Firstname"
"Schmidt","Firstname"
"Consultant","Job"
非常感谢您的帮助
答案 0 :(得分:0)
我自己找到了解决方案。问题在于,对于JAVA API,标准搜索算法似乎是Zero-R,它始终将所有属性归类为最现有的类。
我现在将此行添加到了代码中,结果如预期的那样: ibk.setNearestNeighbourSearchAlgorithm(new LinearNNSearch());
=== Confusion Matrix ===
a b c <-- classified as
3 0 0 | a = Surname
0 4 0 | b = Firstname
0 0 6 | c = Job