在数据挖掘工作中,我被问到:“将k-means聚类应用于原始数据集。将所需聚类的数量设置为已知的类数。在每个实例中为聚类输出。讨论这种分类与可用的真实性之间的区别。“
将具有已知数量属性的数据集聚类的输出如下所示:
kMeans
======
Number of iterations: 8
Within cluster sum of squared errors: 62.4309244109214
Initial starting points (random):
Cluster 0: 4,31,2,1,3
Cluster 1: 5,52,4,3,3
Cluster 2: 5,33,2,4,3
Cluster 3: 3,65,4,5,3
Cluster 4: 4,56,1,1,3
Cluster 5: 5,60,4,4,3
Missing values globally replaced with mean/mode
Final cluster centroids:
Cluster#
Attribute Full Data 0 1 2 3 4 5
(961.0) (173.0) (143.0) (110.0) (126.0) (186.0) (223.0)
========================================================================================
BI-RADS 4.3483 3.9595 4.8486 4.2364 4.7222 3.9247 4.5262
Age 55.4874 48.0867 60.3984 56.3364 61.8372 47.4462 60.7802
Shape 2.7215 2.1313 3.6371 1.7267 3.8858 1 3.861
Margin 2.7963 1.0289 2.7784 3.6537 5 1.0108 4
Density 2.9107 2.8457 2.9117 2.8929 2.9661 2.9004 2.9467
Time taken to build model (full training data) : 0 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 173 ( 18%)
1 143 ( 15%)
2 110 ( 11%)
3 126 ( 13%)
4 186 ( 19%)
5 223 ( 23%)
Class attribute: class
Classes to Clusters:
0 1 2 3 4 5 <-- assigned to cluster
154 50 70 20 164 58 | 0
19 93 40 106 22 165 | 1
Cluster 0 <-- No class
Cluster 1 <-- No class
Cluster 2 <-- No class
Cluster 3 <-- No class
Cluster 4 <-- 0
Cluster 5 <-- 1
Incorrectly clustered instances : 632.0 65.7648 %
我不确定地面真理是什么意思,对于分类和可用地面真理之间的显着差异,可以做出什么评论。
任何输入表示赞赏。
谢谢。