如果通过HDFS
中的Hadoop
结构将具有不同主题的多个文件合并到一个文件中并对此文件执行群集操作,那么,我想知道在Mahout中进行群集后,原始文件(之前的文件) union)被置于什么类别的聚类中。
我该怎么做才能做到这一点?
如果运行K-means聚类依赖于文件计数来输入序列文件? 因为,当我使用以下四个文件创建序列文件时,我无法从K-means回答。
我使用下面的代码创建HDFS文件格式Hadoop in action
book:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class PutMerge {
public static String outputMerg="hdfs\\hdfs.txt";
public static void main() throws IOException {
Configuration conf = new Configuration();
FileSystem hdfs = FileSystem.get(conf);
FileSystem local = FileSystem.getLocal(conf);
//input directory to merge files that there exist
Path inputDir = new Path("MergInput");
//merged file
Path hdfsFile = new Path(outputMerg);
try {
FileStatus[] inputFiles = local.listStatus(inputDir);
FSDataOutputStream out = hdfs.create(hdfsFile);
for (int i=0; i<inputFiles.length; i++) {
System.out.println(inputFiles[i].getPath().getName());
FSDataInputStream in = local.open(inputFiles[i].getPath());
byte buffer[] = new byte[256];
int bytesRead = 0;
while( (bytesRead = in.read(buffer)) > 0) {
out.write(buffer, 0, bytesRead);
}
in.close();
}
out.close();
}
catch (IOException e) {
e.printStackTrace();
}
}
}
使用以下代码进行群集表格Mahout in action
book:
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.mahout.clustering.kmeans.Cluster;
import org.apache.mahout.clustering.kmeans.KMeansClusterer;
import org.apache.mahout.common.distance.CosineDistanceMeasure;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;
public class KMeansClustering {
public static int k;
public static void main(String args[]) throws Exception {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
String vectorsFolder =SparseVectors.SparseOutput+ "\\tfidf-vectors";
SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(vectorsFolder + "\\part-r-00000"), conf);
List<Vector> points = new ArrayList<Vector>();
Text key = new Text();
VectorWritable value = new VectorWritable();
while (reader.next(key, value)) {
points.add(value.get());
}
System.out.println(points.size());
reader.close();
List<Vector> randomPoints = RandomPointsUtil.chooseRandomPoints(points, k);
List<Cluster> clusters = new ArrayList<Cluster>();
System.out.println(randomPoints.size());
int clusterId = 0;
for (Vector v : randomPoints) {
clusters.add(new Cluster(v, clusterId++, new CosineDistanceMeasure()));
}
List<List<Cluster>> finalClusters = KMeansClusterer.clusterPoints(points, clusters,
new CosineDistanceMeasure(), 10, 0.01);
for (Cluster cluster : finalClusters.get(finalClusters.size() - 1)) {
System.out.println("Cluster id: " + cluster.getId() + " center: "
+ cluster.getCenter().asFormatString());
}
}
}