使用weka api进行聚类

时间:2017-03-04 15:33:19

标签: java weka

我使用java + weka lib使用开源代码启动集群数据 它在数据集.arff的格式下正确运行但我想使用movielens的数据集(使用他们的人口统计信息来聚类用户) 文件名是" u.user" 你可以在这里找到文件说明 http://files.grouplens.org/datasets/movielens/ml-100k-README.txt

这是我的代码

import weka.clusterers.SimpleKMeans;
import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;
import java.io.IOException;
public class Clustering {
    public static void main(String args[]) throws Exception{
        //load dataset
        String dataset = "C:/Users/DELL/Desktop/work/u.user";
        DataSource source = new DataSource(dataset);
        //get instances object
        Instances data = source.getDataSet();
        // new instance of clusterer
        SimpleKMeans model = new SimpleKMeans();//Simple EM (expectation maximisation)
        //number of clusters
        model.setNumClusters(4);
        //set distance function
        //model.setDistanceFunction(new weka.core.ManhattanDistance());
        // build the clusterer
        model.buildClusterer(data);
        System.out.println(model);

}
}

运行后出现此错误显示

Exception in thread "main" java.io.IOException: File not found : C:\Users\DELL\Desktop\work\u.names
    weka.core.converters.C45Loader.setSource(C45Loader.java:190)
    weka.core.converters.AbstractFileLoader.setFile(AbstractFileLoader.java:90)
    weka.core.converters.ConverterUtils$DataSource.reset(ConverterUtils.java:306)
    weka.core.converters.ConverterUtils$DataSource.<init>(ConverterUtils.java:141)
    Clustering.main(Clustering.java:24)
    sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    java.lang.reflect.Method.invoke(Method.java:498)
    com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

    at weka.core.converters.C45Loader.setSource(C45Loader.java:190)
    at weka.core.converters.AbstractFileLoader.setFile(AbstractFileLoader.java:90)
    at weka.core.converters.ConverterUtils$DataSource.reset(ConverterUtils.java:306)
    at weka.core.converters.ConverterUtils$DataSource.<init>(ConverterUtils.java:141)
    at Clustering.main(Clustering.java:24)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

Process finished with exit code 1

我确定它是因为文件的扩展,因为当我使用extention.arff其他文件时它工作 你能帮我解决如何聚类我的数据

1 个答案:

答案 0 :(得分:0)

您还需要注意文件格式(不仅仅是扩展名)。转换数据集格式以匹配Weka ARFF format。如果您的数据为user.arff,则需要将扩展​​名更改为* .arff(例如@RELATION user @ATTRIBUTE id INTEGER % this is actually useless @ATTRIBUTE age INTEGER @ATTRIBUTE gender {M,F} @ATTRIBUTE occupation {administrator,artist,doctor,educator,engineer,entertainment,executive,healthcare,homemaker,lawyer,librarian,marketing,none,other,programmer,retired,salesman,scientist,student,technician,writer} % from u.occupation @ATTRIBUTE zipcode STRING @DATA 1,24,M,technician,85711 2,53,F,other,94043 3,23,M,writer,32067 4,24,M,technician,43537 5,33,F,other,15213 6,42,M,executive,98101 7,57,M,administrator,91344 8,36,M,administrator,05201 ... ),并将格式更改为:

weka.core.Instances

您应该能够将数据集解析为SimpleKMeans。但遗憾的是,id会拒绝您的数据:

  

weka.core.UnsupportedAttributeTypeException:   weka.clusterers.SimpleKMeans:无法处理字符串属性!

所以你留下(至少)3个选项:

  1. 将数据的功能矢量化或转换为数值(同时删除无用的数据,如weka.clusterers.HierarchicalClusterer
  2. 使用另一种可以处理分类值的聚类算法,例如myTestFormatFiles
  3. 结合两种解决方案
  4. 祝你好运!