我一直在使用Weka的J48决策树来对关键字的频率进行分类 在RSS中提供目标类别。我想我可能有问题 将生成的决策树与正确分类的数量进行协调 报告的实例和混淆矩阵。
例如,我的一个.arff文件包含以下数据提取:
@attribute Keyword_1_nasa_Frequency numeric
@attribute Keyword_2_fish_Frequency numeric
@attribute Keyword_3_kill_Frequency numeric
@attribute Keyword_4_show_Frequency numeric
...
@attribute Keyword_64_fear_Frequency numeric
@attribute RSSFeedCategoryDescription {BFE,FCL,F,M, NCA, SNT,S}
@data
0,0,0,34,0,0,0,0,0,40,0,0,0,0,0,0,0,0,0,0,24,0,0,0,0,13,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,10,0,0,0,0,0,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
...
20,0,64,19,0,162,0,0,36,72,179,24,24,47,24,40,0,48,0,0,0,97,24,0,48,205,143,62,78,
0,0,216,0,36,24,24,0,0,24,0,0,0,0,140,24,0,0,0,0,72,176,0,0,144,48,0,38,0,284,
221,72,0,72,0,SNT
...
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,S
依此类推:总共有64个关键字(列)和570行,其中每一行都包含一天中Feed中关键字的频率。在这种情况下,有57个饲料 10天给出总共570条记录。每个关键字都带有前缀 带有代理号,后缀为“频率”。
我对决策树的使用是使用10x验证的默认参数。
Weka报道以下内容:
Correctly Classified Instances 210 36.8421 %
Incorrectly Classified Instances 360 63.1579 %
使用以下混淆矩阵:
=== Confusion Matrix ===
a b c d e f g <-- classified as
11 0 0 0 39 0 0 | a = BFE
0 0 0 0 60 0 0 | b = FCL
1 0 5 0 72 0 2 | c = F
0 0 1 0 69 0 0 | d = M
3 0 0 0 153 0 4 | e = NCA
0 0 0 0 90 10 0 | f = SNT
0 0 0 0 19 0 31 | g = S
树如下:
Keyword_22_health_Frequency <= 0
| Keyword_7_open_Frequency <= 0
| | Keyword_52_libya_Frequency <= 0
| | | Keyword_21_job_Frequency <= 0
| | | | Keyword_48_pic_Frequency <= 0
| | | | | Keyword_63_world_Frequency <= 0
| | | | | | Keyword_26_day_Frequency <= 0: NCA (461.0/343.0)
| | | | | | Keyword_26_day_Frequency > 0: BFE (8.0/3.0)
| | | | | Keyword_63_world_Frequency > 0
| | | | | | Keyword_31_gaddafi_Frequency <= 0: S (4.0/1.0)
| | | | | | Keyword_31_gaddafi_Frequency > 0: NCA (3.0)
| | | | Keyword_48_pic_Frequency > 0: F (7.0)
| | | Keyword_21_job_Frequency > 0: BFE (10.0/1.0)
| | Keyword_52_libya_Frequency > 0: NCA (31.0)
| Keyword_7_open_Frequency > 0
| | Keyword_31_gaddafi_Frequency <= 0: S (32.0/1.0)
| | Keyword_31_gaddafi_Frequency > 0: NCA (4.0)
Keyword_22_health_Frequency > 0: SNT (10.0)
我的问题涉及将矩阵与树协调,反之亦然。据,直到...为止 我理解结果,像(461.0 / 343.0)这样的评级表明461个实例被归类为NCA。但是,当矩阵仅显示153时,怎么可能呢?我是 不知道怎么解释这个,所以欢迎任何帮助。
提前致谢。
答案 0 :(得分:2)
每个叶子的括号中的数字应该被理解为(此叶子上此分类的总实例数/此叶子中不正确分类的数量)。
在你的第一个NCA叶子的例子中,它说有461个测试实例被归类为NCA,而在那些461中,有343个不正确的分类。因此,该叶子上有461-343 = 118个正确分类的实例。
查看您的决策树,请注意NCA也处于其他位置。我统计了118 + 3 + 31 + 4 = 156个正确分类的实例,其中461 + 3 + 31 + 4 = 499个NCA总分类。
您的混淆矩阵在39 + 60 + 72 + 69 + 153 + 90 + 19 = 502 NCA总分类中显示了153个正确的NCA分类。
因此树(156/499)与您的混淆矩阵(153/502)之间存在细微差别。
请注意,如果您从命令行运行Weka,它会输出一个树和一个混淆矩阵,用于测试所有训练数据以及使用交叉验证进行测试。请注意,您正在查看正确的树的正确矩阵。 Weka输出训练和测试结果,产生两对矩阵和树。你可能把它们搞混了。