我正在尝试使用Mahout KMeans进行简单的应用。我从数据库内容手动创建一系列向量。我只是想将这些向量提供给Mahout(0.9),例如KMeansClusterer并使用输出。
我读了 Mahout in Action (0.5版的例子)和许多在线论坛以获得背景资料。但我无法再通过Hadoop使用没有文件名和文件路径用法的Mahout KMeans(或相关群集)。文档很粗略,但Mahout可以用这种方式了吗?目前是否有使用Mahout KMeans的示例(不是命令行)。
private List<Cluster> kMeans(List<Vector> allvectors, double closeness, int numclusters, int iterations) {
List<Cluster> clusters = new ArrayList<Cluster>() ;
int clusterId = 0;
for (Vector v : allvectors) {
clusters.add(new Kluster(v, clusterId++, new EuclideanDistanceMeasure()));
}
List<List<Cluster>> finalclusters = KMeansClusterer.clusterPoints(allvectors, clusters, 0.01, numclusters, 10) ;
for(Cluster cluster : finalclusters.get(finalclusters.size() - 1)) {
System.out.println("Fuzzy Cluster id: " + cluster.getId() + " center: " + cluster.getCenter().asFormatString());
}
return clusters ;
}
答案 0 :(得分:2)
首先,您需要将矢量写入Seq文件。以下是代码:
List<VectorWritable> vectors = new ArrayList<>();
double[] vectorValues = {<your vector values>};
vectors.add(new VectorWritable(new NamedVector(new DenseVector(vectorValues), userName)));
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
fs = FileSystem.get(new File(writeFile).toURI(), conf);
writer = new SequenceFile.Writer(fs, conf, new Path(writeFile), Text.class, VectorWritable.class);
try {
int i = 0;
for (VectorWritable vw : vectors) {
writer.append(new Text("mapred_" + i++), vw);
}
} finally {
Closeables.close(writer, false);
}
然后使用以下行生成集群。您需要向KMeans提供初始集群,因此我使用Canopy生成初始集群。
但是,您将无法理解群集的输出,因为它是Seq文件格式。您需要在Mahout-Integration.jar中执行ClusterDumper类,以便最终读取和理解您的集群。
Configuration conf = new Configuration();
CanopyDriver.run(conf, new Path(inputPath), new Path(canopyOutputPath), new ManhattanDistanceMeasure(), (double) 3.1, (double) 2.1, true, (double) 0.5, true );
// now run the KMeansDriver job
KMeansDriver.run(conf, new Path(inputPath), new Path(canopyOutputPath + "/clusters-0-final/"), new Path(kmeansOutput), new EuclideanDistanceMeasure(), 0.001, 10, true, 2d, false);