我已经尝试在R中手动执行内核K-Means算法,但我的循环运行时间超过30分钟,这里是代码:
#Calculanting kernel k-means
rbfkmeans<-function(data,c,q=0.02,L=0.7){
#associating random classifications to each observation
iter=0
data<-data%>%
mutate(cluster=sample(1:c,nrow(data),replace=T))
mini=rep(1,nrow(data))
## DISTÂNCIA EUCLIDIANA
# Remember:
#1.|| a || = sqrt(aDOTa),
#2. d(x,y) = || x - y || = sqrt((x-y)DOT(x-y))
#3. aDOTb = sum(a*b)
d<-function(x,y){
aux=x-y
dis=sqrt(sum(aux*aux))
return(dis)
}
##Radial Basis Function Kernel
# Remember :
# 1.K(x,x')=exp(-q||x-x'||^2) where ||x-x'|| is could be defined as the
# euclidian distance and 'q' it's the gamma parameter
rbf<-function(x,y,q=0.2){
aux<-d(x,y)
rbfd<-exp(-q*(aux)^2)
return(rbfd)
}
#
#calculating the kernel matrix
kernelmatrix=matrix(0,nrow(data),nrow(data))
for(i in 1:nrow(data)){
for(j in 1:nrow(data)){
kernelmatrix[i,j]=rbf(data[i,1:(ncol(data)-1)],data[j,1:(ncol(data)-1)],q)
}
}
r=rep(0,nrow(data))
distance=matrix(0,nrow(data),c)
while( (sum(r==data[,'cluster'])!=nrow(data)) && iter <30 ){
ans=0
#Calculating the distaces in the kernelized versions (RBF example)
print('running')
third=rep(0,c)#here third means the calculation from centers distances
#as they not depend of each obserativion.
for(g in 1:c){
ans=0
for(k in 1:nrow(data)){
for(l in 1:nrow(data)){
ans = ans + (data[k,'cluster']==g)*(data[l,'cluster']==g)*kernelmatrix[k,l]
}
}
third[g]=ans
}
for (ii in 1:nrow(data)){ #for (ii in 1:nrow(data))
for(j in 1:c) { #for(j in 1:c)
distance[ii,j]= kernelmatrix[ii,ii]-2*sum((data[,'cluster']==j)*kernelmatrix[ii,])/sum(data[,'cluster']==j)+third[j]/(sum(data[,'cluster']==j)^2)
}
}
r=data[,'cluster']
#Checking the shortest distance
for(k in 1:nrow(data)){
data[k,'cluster']=match(min(distance[k,]),distance[k,])
mini[k]=min(distance[k,])
}
plot(data[1:(ncol(data)-1)], col=data$cluster)
iter=iter+1
print(paste('Iteration number:',iter))
print(paste('Mean of min. distances:',mean(mini)))
#print(g==data$'cluster')
}
return(data)
}
有人知道如何选择这个吗?它是#third术语计算的主要问题,我猜它在循环中验证(data[k,'cluster']==g)
会浪费太多时间,但我没有更多的想法来改进它...
OBS:data[k,'cluster']==g
,用于验证观察是否属于群集。
编辑:代码中需要很长时间才能运行它的部分:
for(g in 1:c){
ans=0
for(k in 1:nrow(data)){
for(l in 1:nrow(data)){
ans = ans + (data[k,'cluster']==g)*(data[l,'cluster']==g)*kernelmatrix[k,l]
}
}
third[g]=ans
}
答案 0 :(得分:1)
看起来您可以优化距离和径向功能。 你的距离得到总和的sqrt,你的径向函数正方形它否定它
Map
此外,您应该能够使用转换代码来使用foreach循环并且能够利用其中一个并行化库(例如doparallel)