Question

我是Mahout的初学者，我使用Mahout 0.8并按照https://cwiki.apache.org/MAHOUT/clustering-of-synthetic-control-data.html中的教程

当我使用时： mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job -i testdata -o output -t1 20 -t2 50 -k 5 -x 20 -ow

然后使用clusterdump提取集群中心：

mahout clusterdump --input output/clusters-20-final --output /media/synthetic_control.center

在synthetic_control.center文件中

：

VL-585{n=50 c=[29.832, 29.589, 29.405, 28.516, 29.600, ….] r=[3.152, 3.518, 3.292, …]}

VL-591{n=197 c=[29.984, 29.681,…] r=[3.602, 3.558, 3.364,…]}

VL-595{n=203 c=[….] r=[….]}

VL-597{n=61 c=[….] r=[….]}

VL-599{n=43 c=[….] r=[….]}

VL-585{n=1 c=[….] r=[….]}

VL-591{n=27 c=[….] r=[….]}

VL-595{n=1 c=[….] r=[….]}

VL-597{n=1 c=[….] r=[….]}

VL-599{n=16 c=[….] r=[….]}

似乎kmean产生了10个簇，但我对k的初始设置是5。

我也尝试了其他k，它总是生成双重群集。

任何人都可以帮我吗？非常感谢！

Answer 1

哈哈！最后，在阅读完代码后，我在mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job !!中发现了这个错误。

这是事情：在syntheticcontrol.kmeans.Job中，如果用户设置了k，那么作业将不会在kmeans之前运行冠层聚类，而是直接运行kmean。当运行kmean时，它需要每个集群的初始中心，因此它使用RandomSeedGenerator随机生成每个集群中心并将此文件（part-randomSeed）放到 output / clusters-0文件夹之后，这个kmean首先使用这些中心对所有点进行分类并更新集群中心，并将这些中心置于 output / clusters-0文件夹。那么，在clusters-0文件夹中，有两组中心!!因此，第一次迭代将读取加倍的簇！这就是为什么这个工作总是产生双倍的簇号！

解决方案：将part-randomSeed保存到另一个文件夹。在org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

第142行，Path clusters = new Path(output, Cluster.INITIAL_CLUSTERS_DIR);

更改为Path clusters = new Path(output, "randomSeeds");

Mahout KMeans生成的簇数比我的初始K设置高一倍

1 个答案: