我从Mahout的分类中得到以下输出:
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 3948 93,4217%
Incorrectly Classified Instances : 278 6,5783%
Total Classified Instances : 4226
=======================================================
Confusion Matrix
-------------------------------------------------------
a b <--Classified as
3747 263 | 4010 a = NOT_Science fiction
15 201 | 216 b = Science fiction
=======================================================
Statistics
-------------------------------------------------------
Kappa 0,5594
Accuracy 93,4217%
Reliability 62,1657%
Reliability (standard deviation) 0,5384
Mahout如何计算可靠性?
根据https://issues.apache.org/jira/browse/MAHOUT-941,它应该是用户准确性。据我了解用户准确性,它应该为每列计算正确分类的实例除以按此类别分类的总数。 (http://spatial-analyst.net/ILWIS/htm/ilwismen/confusion_matrix.htm)
到目前为止,我无法弄清楚如何计算62,1657%。
如果我计算课程的平均值,我会得到以下内容: ((3747/4010)+(201/216))/ 2 = 0.932 - &gt; 93.2%
如果我计算列的平均值,我会得到以下结果: ((3747/3762)+(201/464))/ 2 = 0.715 - &gt; 71.5%
答案 0 :(得分:0)
可靠性是用户准确性。在当前版本(0.9)中未正确计算。
public RunningAverageAndStdDev getNormalizedStats() {
RunningAverageAndStdDev summer = new FullRunningAverageAndStdDev();
for(int d = 0; d < confusionMatrix.length; d++) {
double total = 0;
for(int j = 0; j < confusionMatrix.length; j++) {
total += confusionMatrix[d][j];
}
summer.addDatum(confusionMatrix[d][d] / (total + 0.000001));
}
return summer;
}
问题是Confusion Matrix包含所有标签和一个额外的“DEFAULT”标签。似乎“DEFAULT”标签适用于未分类的实例。如果您没有未分类的实例,则会干扰结果。另外检查“DEFAULT”标签,它对我有用。
public RunningAverageAndStdDev getNormalizedStats() {
RunningAverageAndStdDev summer = new FullRunningAverageAndStdDev();
for (int d = 0; d < confusionMatrix.length; d++) {
//Do not add the "DEFAULT" label to the calculation
if(labelMap.get(defaultLabel) == d)
continue;
double total = 0;
for (int j = 0; j < confusionMatrix.length; j++) {
total += confusionMatrix[d][j];
}
summer.addDatum(confusionMatrix[d][d] / (total + 0.000001));
}
return summer;
}