我有5个文件文字。 我将这些文件合并为1个文件。该文件包含大约60个句子。 我想将该文件聚类到5个集群。 我正在使用weka进行聚类。
public static void doClustering(String pathSentences, int numberCluster) throws IOException {
Helper.deleteAllFileInFolder("results");
//so cum bang so cau trong file / so cau trung binh trong 1 file
HashMap<Integer, String> sentences = new HashMap<>();
HashMap<Integer, Integer> clustering = new HashMap<>();
try {
StringToWordVector filter = new StringToWordVector();
SimpleKMeans kmeans = new SimpleKMeans();
FastVector atts = new FastVector(5);
atts.addElement(new Attribute("text", (FastVector) null));
Instances docs = new Instances("text_files", atts, 0);
Scanner sc = new Scanner(new File(pathSentences));
int count = 0;
while (sc.hasNextLine()) {
String content = sc.nextLine();
double[] newInst = new double[1];
newInst[0] = (double) docs.attribute(0).addStringValue(content);
docs.add(new SparseInstance(1.0, newInst));
sentences.put(sentences.size(), content);
clustering.put(clustering.size(), -1);
}
NGramTokenizer tokenizer = new NGramTokenizer();
tokenizer.setNGramMinSize(10);
tokenizer.setNGramMaxSize(10);
tokenizer.setDelimiters("\\W");
filter.setTokenizer(tokenizer);
filter.setInputFormat(docs);
filter.setLowerCaseTokens(true);
filter.setWordsToKeep(1);
Instances filteredData = Filter.useFilter(docs, filter);
kmeans.setPreserveInstancesOrder(true);
kmeans.setNumClusters(numberCluster);
kmeans.buildClusterer(filteredData);
int[] assignments = kmeans.getAssignments();
int i = 0;
for (int clusterNum : assignments) {
clustering.put(i, clusterNum);
i++;
}
PrintWriter[] pw = new PrintWriter[numberCluster];
for (int j = 0; j < numberCluster; j++) {
pw[j] = new PrintWriter(new File("results/result" + j + ".txt"));
}
sentences.entrySet().stream().forEach((entry) -> {
Integer key = entry.getKey();
String value = entry.getValue();
Integer cluster = clustering.get(key);
pw[cluster].println(value);
});
for (int j = 0; j < numberCluster; j++) {
pw[j].close();
}
} catch (Exception e) {
System.out.println("Error K means " + e);
}
}
当我更改输入文件的顺序时,群集结果也会有所不同。 你能帮帮我解决吗?非常感谢你。
答案 0 :(得分:1)
k-means是一种随机算法。
它选择一些实例作为初始种子,然后搜索局部最优值。
当然,它会产生不同的结果!
如果它们变化很大,这表明它不能很好地工作。如果您的数据适合k-means,那么大多数运行将产生非常相似的结果(标签的排列除外)。