我试图在mahout中聚类数据。显示错误。 这是错误
java.lang.ArrayIndexOutOfBoundsException: 0
at org.apache.mahout.clustering.classify.ClusterClassificationMapper.populateClusterModels(ClusterClassificationMapper.java:129)
at org.apache.mahout.clustering.classify.ClusterClassificationMapper.setup(ClusterClassificationMapper.java:74)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
13/03/07 19:29:31 INFO mapred.JobClient: map 0% reduce 0%
13/03/07 19:29:31 INFO mapred.JobClient: Job complete: job_local_0010
13/03/07 19:29:31 INFO mapred.JobClient: Counters: 0
java.lang.InterruptedException: Cluster Classification Driver Job failed processing E:/Thesis/Experiments/Mahout dataset/input
at org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
at org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135)
at org.apache.mahout.clustering.kmeans.KMeansDriver.clusterData(KMeansDriver.java:260)
at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:152)
at com.ifm.dataclustering.SequencePrep.<init>(SequencePrep.java:95)
at com.ifm.dataclustering.App.main(App.java:8)
这是我的代码
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path vector_path = new Path("E:/Thesis/Experiments/Mahout dataset/input/vector_input");
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, vector_path, Text.class, VectorWritable.class);
VectorWritable vec = new VectorWritable();
for (NamedVector outputVec : vector) {
vec.set(outputVec);
writer.append(new Text(outputVec.getName()), vec);
}
writer.close();
// create initial cluster
Path cluster_path = new Path("E:/Thesis/Experiments/Mahout dataset/clusters/part-00000");
SequenceFile.Writer cluster_writer = new SequenceFile.Writer(fs, conf, cluster_path, Text.class, Kluster.class);
// number of cluster k
int k=4;
for(i=0;i<k;i++) {
NamedVector outputVec = vector.get(i);
Kluster cluster = new Kluster(outputVec, i, new EuclideanDistanceMeasure());
// System.out.println(cluster);
cluster_writer.append(new Text(cluster.getIdentifier()), cluster);
}
cluster_writer.close();
// set cluster output path
Path output = new Path("E:/Thesis/Experiments/Mahout dataset/output");
HadoopUtil.delete(conf, output);
KMeansDriver.run(conf, new Path("E:/Thesis/Experiments/Mahout dataset/input"), new Path("E:/Thesis/Experiments/Mahout dataset/clusters"),
output, new EuclideanDistanceMeasure(), 0.001, 10,
true, 0.0, false);
SequenceFile.Reader output_reader = new SequenceFile.Reader(fs,new Path("E:/Thesis/Experiments/Mahout dataset/output/" + Kluster.CLUSTERED_POINTS_DIR+ "/part-m-00000"), conf);
IntWritable key = new IntWritable();
WeightedVectorWritable value = new WeightedVectorWritable();
while (output_reader.next(key, value)) {
System.out.println(value.toString() + " belongs to cluster "
+ key.toString());
}
reader.close();
}
答案 0 :(得分:3)
输入/输出数据的路径似乎不正确。 MapReduce作业在群集上运行。因此,数据是从HDFS读取的,而不是从本地硬盘读取的。
错误消息:
java.lang.InterruptedException: Cluster Classification Driver Job failed processing E:/Thesis/Experiments/Mahout dataset/input
at org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
为您提供有关错误路径的提示。
在运行作业之前,请确保将输入数据上传到HDFS:
hadoop fs -mkdir input
hadoop fs -copyFromLocal E:\\file input
...
然后代替:
new Path("E:/Thesis/Experiments/Mahout dataset/input")
您应该使用HDFS路径:
new Path("input")
或
new Path("/user/<username>/input")
修改强>
使用FileSystem#exists(Path path)要检查Path
是否有效。