Mahout Canopy聚类示例

时间:2014-08-21 14:24:14

标签: mahout

我正试图从Apache Mahout Cookbook中的示例合成控制数据中找出Canopy聚类。但是,不是得到6个聚类,而是得到600 - 对于集合中的每个样本一个。

C-0 {n = 1 c = [0:28.781,1:34.463,2:31.338,3:31.283,4:28.921,5:33.760,6:25.397,7:27.785,8:35.248,9 :27.116,10:32.872,11:29.217,12:36.025,13:32.337,15:34.525,16:32.872,17:34.117,18:26.524,19:27.662,20:26.369,21:25.774,22:29.270 ,25:30.733,26:29.505,27:33.029,28:25.040,31:28.917,32:24.344,33:26.120,34:34.942,35:25.029,36:26.631,37:35.654,38:28.435,39 :29.150,40:28.158,41:26.193,42:33.318,43:30.977,44:27.044,45:35.534,46:26.235,47:28.996,48:32.004,49:31.056,50:34.255,51:28.072 ,52:28.940,53:35.497,54:29.747,56:31.433,57:24.556,58:33.743,59:25.047,60:34.932] r = []}

C-1 {n = 1 c = [0:24.892,1:25.741,3:27.553,4:32.822,5:27.879,6:31.593,7:31.486,8:35.547,9:27.952,10 :31.660,11:27.542,12:31.189,13:27.487,14:31.391,16:27.811,18:24.488,20:27.592,21:35.627,22:35.410,23:31.417,24:30.745,25:24.131 ,26:35.142,27:30.472,28:31.987,29:33.662,30:25.551,31:30.469,32:33.647,33:25.070,34:34.077,35:32.598,36:28.304,37:26.147,38 :26.941,39:31.520,40:33.109,41:24.149,42:28.516,43:25.791,44:35.952,45:26.530,46:24.858,47:25.956,48:32.836,49:28.532,50:26.346 ,51:30.621,52:28.986,53:29.405,54:32.558,55:31.021,56:26.642,57:28.433,58:33.656,59:26.424,60:28.466] r = []}

C-2 {n = 1 c = [0:31.399,1:30.632,2:26.398,3:24.291,4:27.861,5:28.549,6:24.972,7:32.436,8:25.224,9 :27.307,10:31.839,11:27.259,12:28.257,13:26.582,14:24.046,15:35.063,16:31.572,17:32.561,18:31.031,19:34.120,20:26.934,21:31.478 ,22:35.017,23:32.385,24:24.332,25:30.200,26:31.245,27:26.681,28:31.514,29:28.878,30:27.309,31:24.246,33:26.963,34:25.292,35 :31.611,36:24.713,37:27.481,38:24.208,39:26.806,40:35.125,41:32.629,42:31.056,43:26.358,44:28.086,45:31.439,46:27.306,47:29.608 ,48:35.973,49:34.144,50:27.172,51:33.632,52:26.597,53:25.539,54:32.543,55:25.577,56:29.990,57:31.351,59:33.900,60:29.545] r = []}

C-3 {n = 1 c = [0:25.774,2:30.526,3:35.421,4:25.603,5:27.970,8:25.270,9:28.132,11:29.427,12:31.455,13 :27.320,16:28.956,17:28.992,18:29.958,19:30.277,20:30.445,21:24.304,22:24.314,24:35.097,25:25.368,26:32.097,27:33.330,28:25.010 ,29:35.316,30:31.626,31:29.281,32:34.202,33:26.508,34:32.228,35:25.527,36:24.824,38:27.559,39:28.371,40:32.367,41:26.975,42 :35.935,43:35.115,44:24.375,45:27.608,46:27.843,47:29.856,48:32.419,49:26.891,50:31.321,51:29.385,52:34.334,53:24.738,54:35.769 ,56:31.873,57:34.205,58:31.156,60:34.629] r = []}

依此类推,直到C-600。

有人能想出一个理由吗?

我正在使用

mahout canopy -i $WORK_DIR/sequencefile/synthetic_control.seq -o
                 $WORK_DIR/output/canopy.output -t1 80 -t2 55

我在Hadooop 1.2.1上使用Mahout 0.9。本书的例子是针对Mahout的0.9版本,调用函数的方式有变化吗?

我甚至尝试使用不同的t1和t2值,但结果相同。

由于

1 个答案:

答案 0 :(得分:0)

Canopy WAS用于在Kmeans中创建参数“K”的猜测。它对t1和t2的选择非常敏感,这对IMO来说是无用的。因此它被弃用了。

在Mahout中没有一个好的选择,但是你可以看看流媒体kmeans或尝试在kmeans的结果上使用clusterdump并找到适合你真实数据的最佳k,寻找最高的内聚力和最大的分离。