您好我试图在第7章(k-Mean Clustering)中运行Mahout的示例。有人可以指导我如何在带有Mahout(0.7)的Hadoop集群(单节点CDH-4.2.1)中运行该示例
这是我遵循的步骤:
将代码(从Github)复制到我的本地计算机上的Eclipse IDE中。
将这些罐子插入我的Eclipse项目中。
Hadoop的共2.0.0-cdh4.2.1.jar
Hadoop的HDFS-2.0.0-cdh4.2.1.jar
Hadoop的MapReduce的客户端 - 芯2.0.0-cdh4.2.1.jar
象夫核-0.7-cdh4.3.0.jar
象夫核-0.7-cdh4.3.0-job.jar
象夫-数学0.7 cdh4.3.0.jar
制作了这个项目的Jar并将该jar复制到我的Hadoop集群
执行此命令
user @ INFPH01463U:〜$ hadoop jar /home/user/apurv/Kmean.jar tryout.SimpleKMeansClustering
给了我以下错误
Exception in thread "main" java.lang.NoClassDefFoundError: FileSystem
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2427)
at java.lang.Class.getMethod0(Class.java:2670)
at java.lang.Class.getMethod(Class.java:1603)
at org.apache.hadoop.util.RunJar.main(RunJar.java:202)
Caused by: java.lang.ClassNotFoundException: FileSystem
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 5 more
任何人都可以帮助我解决我错过的问题或我的执行方式错误吗?
其次,我想知道如何在CSV文件上运行K-mean Clustering?
提前致谢:)
答案 0 :(得分:0)
给定的代码具有误导性,代码
Cluster cluster = new Cluster(vec, i, new EuclideanDistanceMeasure());
writer.append(new Text(cluster.getIdentifier()), cluster);
}
writer.close();
KMeansDriver.run(conf, new Path("testdata/points"), new Path("testdata/clusters"),
new Path("output"), new EuclideanDistanceMeasure(), 0.001, 10,
true, false);
SequenceFile.Reader reader = new SequenceFile.Reader(fs,
new Path("output/" + Cluster.CLUSTERED_POINTS_DIR
+ "/part-m-00000"), conf);
应替换为
Kluster cluster = new Kluster(vec, i, new EuclideanDistanceMeasure());
writer.append(new Text(cluster.getIdentifier()), cluster);
}
writer.close();
KMeansDriver.run(conf, new Path("testdata/points"), new Path("testdata/clusters"),
new Path("output"), new EuclideanDistanceMeasure(), 0.001, 10,
true, false);
SequenceFile.Reader reader = new SequenceFile.Reader(fs,
new Path("output/" + Kluster.CLUSTERED_POINTS_DIR
+ "/part-m-00000"), conf);
Cluster是一个接口,而Kluster是一个类。有关详细信息,请查看Mahout API Javadoc。
要使用csv文件运行kmeans,首先必须创建一个SequenceFile作为KmeansDriver中的参数传递。以下代码读取CSV文件“points.csv”的每一行并将其转换为vector并将其写入SequenceFile“points.seq”
try (
BufferedReader reader = new BufferedReader(new FileReader("testdata2/points.csv"));
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,new Path("testdata2/points.seq"), LongWritable.class, VectorWritable.class)
) {
String line;
long counter = 0;
while ((line = reader.readLine()) != null) {
String[] c = line.split(",");
if(c.length>1){
double[] d = new double[c.length];
for (int i = 0; i < c.length; i++)
d[i] = Double.parseDouble(c[i]);
Vector vec = new RandomAccessSparseVector(c.length);
vec.assign(d);
VectorWritable writable = new VectorWritable();
writable.set(vec);
writer.append(new LongWritable(counter++), writable);
}
}
writer.close();
}
希望它有所帮助!!