我尝试使用这个R代码:
fitBData <- function(filePath="PATH") {
dataset <- read.table(filePath, header=TRUE, sep=",", na.strings="NaN")
clusterdata <- dataset[,3:7]
# rownames(clusterdata) <- dataset$WallPrimAddr
clusterdata_norm <- clusterdata
# clusterdata <- clusterdata[rowSums(is.na(clusterdata_norm))==0,]
# print( clusterdata_norm[rowSums(is.na(clusterdata_norm))>0,])
wss <- (nrow(clusterdata_norm)-1)*sum(apply(clusterdata_norm,2,var))
# print(sum(is.na(clusterdata_norm)))
for (i in 2:30) {
km <- kmeans(clusterdata_norm,centers=i)
# print(km$totss)
wss[i] <- sum(km$withinss)
}
plot(1:30, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")
fit <- kmeans(clusterdata_norm, 11)
results <- data.frame(dataset, fit$cluster)
colnames(results)[20] <- "ClusterNumber"
write.table(results, "PATH", sep=",")
write.table(fit$centers, "PATH", sep=",")
}
对csv文件中的数据运行k均值聚类,格式如下:
WalletNodeID,WallPrimAddr,pageRank,averageNumInputPerDay,averageNumOutputPerDay,averageInputVolumePerDay,averageOutputVolumePerDay
3,3GpGgT3sFxcBtt6JZRG3Kn7N4JfmsszXy6,47.25590464,5.205061695,0.035746272,0.581360333,0.120171064
4,3376X6m1YX3Jm4J4c2zq8w93VdSnUnZj31,0.088634463,-0.192355963,-0.040254904,0.020482286,-0.085882672
前两列仅用于识别每个实体。
运行代码后,抛出此错误:
Error in names(x) <- value :
'names' attribute [20] must be the same length as the vector [8]
我发现它很奇怪,因为数据文件是另一个包含21列的数据文件的摘录。并且代码可以在没有任何问题的情况下处理其他数据文件(在显示的代码中对行clusterdata <- dataset[,3:21]
使用行clusterdata <- dataset[,3:7]
)。
为什么会这样?