编辑:代码更新,评论,效果信息
我试图编写K-means ||在Java中。 (http://vldb.org/pvldb/vol5/p622_bahmanbahmani_vldb2012.pdf) 但是,它并没有很好地运作。与标准K-means相比,运行时间增加并不令我感到惊讶。我更想知道为什么我的程序的检测率用K-means ||训练与使用标准K-means的训练相比较低。怎么可能选择集群点比偶然选择集群点更差?
更新:如果在互联网关闭时发现了一些错误,k-means ||现在表现与k-means标准一样好 - 但不会好一点。
我很确定我的代码是错误的,但经过几个小时的搜索,我不知道我在哪里犯了错误(坦率地说,我对这个很陌生)主题)。
所以我希望你能看到我做错了什么。这是我的播种选项的代码:
public void training(int stop, int numberIt, double epsilon, boolean advanced){
double d=Double.MAX_VALUE,s=0;
int nearestprototype=0;
int [] myprototype=new int[trainingsSet.size()];
Random random=new Random();
//
long t1=System.currentTimeMillis();
if(!advanced){//standard random k-means seeding; random datapoints are choosen as prototypes
for(int i=0; i<k; i++){
int rand = random.nextInt(trainingsSet.size());
prototypes[i]=trainingsSet.getVectorAtIndex(rand);
}
}else{ //state-of-the-art k-means|| a.k.a k-means++ scalable seeding; explanation here: http://vldb.org/pvldb/vol5/p622_bahmanbahmani_vldb2012.pdf
prototypes[0]=trainingsSet.getVectorAtIndex(random.nextInt(trainingsSet.size())); //first protoype, chosen randomly
Vector<DataVector>kproto=new Vector<DataVector>(); //saves the prototypes
kproto.add(prototypes[0]);
for(int i=0;i<trainingsSet.size();i++){ //gets distance to all data points, sum it up
s+=trainingsSet.getVectorAtIndex(i).distance2(kproto.elementAt(0));
}
double it=Math.floor(Math.log(s)); // calculates how often the loop for step 4 and 5 is executed
for(int c=0; c<it; c++){
int[]psi=new int[trainingsSet.size()];//saves minimum distance to a protoype for every element
for(int i=0; i<trainingsSet.size();i++){
double min=Double.POSITIVE_INFINITY;
for(int j=0;j<kproto.size();j++){
double dist=trainingsSet.getVectorAtIndex(i).distance2(kproto.elementAt(j));
if(min>dist){
min=dist;
}
}
psi[i]=(int) min;
}
double phi_c=0;
for(int i=0; i<trainingsSet.size();i++)
phi_c+=psi[i]; //sums up squared distances
for(int f=0; f<trainingsSet.size();f++){
double p_x=5*psi[f]/phi_c; //oversampling factor 0.1*k (k is 50 in my case)
if(p_x>random.nextDouble()){
kproto.addElement(trainingsSet.getVectorAtIndex(f));//adds data point to the prototype set with a probability
//depending on its distance to the next prototype
}
}
}
int[]w=new int[kproto.size()]; //every prototype gets a value in w; the value is increased if the prototype has a minimum distance to a data point
for(int i=0; i<trainingsSet.size();i++){
double min=trainingsSet.getVectorAtIndex(i).distance2(kproto.elementAt(0));
if(min==0)
continue;
int index=0;
for(int j=1; j<kproto.size();j++){
double save=trainingsSet.getVectorAtIndex(i).distance2(kproto.elementAt(j));
if(min>save){
min=save;
index=j;
}
}
w[index]++;
}
int[]wtotal=new int[kproto.size()]; //wtotal sums the w values up
for(int i=0;i<kproto.size();i++){
for(int st=0; st<=i;st++){
wtotal[i]+=w[st];
}
}
int[]cselect=new int[k];//cselect saves the final prototypes
int stoppoint=0;
boolean repeat=false; //repeat lets the choosing process repeat if the prototype has already been selected
for(int kk=0;kk<k;kk++){
do{
repeat=false;
int stopper=random.nextInt(wtotal[kproto.size()-1]);//randomly choose a int and check in which interval it lies
for(int st=wtotal.length-1;st>=0;st--){
if(stopper>=wtotal[st]){
stoppoint=wtotal.length-st-1;
break;
}
}
for(int i=0; i<kk;i++){
if(cselect[i]==stoppoint)
repeat=true;
}
}while(repeat);
//are all prototypes overwritten?
prototypes[kk]=kproto.get(stoppoint);//the number of the interval is connected to a prototype; the prototype is added to the final set of prototypes "prototypes"
cselect[kk]=stoppoint;
}
}
long t2=System.currentTimeMillis();
System.out.println(advanced+" Init time: "+(t2-t1));
表现显示两个选项(标准,k-means ||)达到正确聚类的水平(约85%)。但是,初始化的运行时间不同。 对于标准k-均值,种子是准立即的,而k-均为||需要600-900毫秒(1000个数据点)。之后标准最大化/期望的收敛需要相同的时间(大约1900-2500ms)。这是刺激因为k-means ||应该收敛得更快。
我希望你发现一些错误,或者如果我期待别的东西而不是k-means ||来解释我可以提供。 谢谢你的帮助!