我知道在很多情况下都可以在线回答这个问题,但是由于它是如此依赖于数据集,所以我想知道是否存在一种使用相对简单的数据集在KNN算法中找到最佳K值的简单方法。
我的响应变量是行为类别(列E:事件),预测变量是活动传感器的三个轴(列B至D)。 sample是我的数据的样子。
在我编写的用于运行knn分析的代码下面找到。 datanet
对象看起来就像我上传的示例图像。我将前150行用作训练,其余[151至240]行用作测试。
在这种情况下,我使用的K值为10,但是在为不同的K值运行脚本之后,我显然得到了不同的输出,因此想知道选择K值的最佳方法是什么?最适合我的数据集。特别是,我需要使用R进行编码的帮助。
library(data.table)
#From the file "Collar_#.txt", just select the columns ACTIVITY_X, ACTIVITY_Y, ACTIVITY_Z and Event
dataraw<-fread("Collar_41361.txt", select = c("ACTIVITY_X","ACTIVITY_Y","ACTIVITY_Z","Event"))
#Now, delete all rows containg the string "End"
datanet<-dataraw[!grepl("End", dataraw$Event),]
#Then, read only the columns ACTIVITY_X, ACTIVITY_Y and ACTIVITY_Z for a selected interval that will act as a trainning set
trainset <- datanet[1:150, !"Event"]
View(trainset)
#Create the behavioural classes. Note that the number of rows should be in the same interval as the trainset dataset
behaviour<-datanet[1:150,!1:3]
View(behaviour)
#Test file. This file contains sensor data only, and behaviours would be associated based on the trainset and behaviour datasets
testset<-datanet[151:240,!"Event"]
View(testset)
#Converting inputs into matrix
train = as.matrix(trainset, byrow = T, ncol=3)
test = as.matrix(testset, byrow = T, ncol=3)
classes=as.matrix(behaviour,byrow=T,ncol=1)
library(stats)
library(class)
#Now running the algorithm. But first we set the k value.
for kk=10
kn1 = knn(train, test, classes, k=kk, prob=TRUE)
prob = attributes(.Last.value)
clas1=factor(kn1)
#Write results, this is the classification of the testing set in a single column
filename = paste("results", kk, ".csv", sep="")
write.csv(clas1, filename)
#Write probs to file, this is the proportion of k nearest datapoints that contributed to the winning class
fileprobs = paste("probs", kk, ".csv", sep="")
write.csv (prob$prob, fileprobs)
我还上传了脚本输出的示例image。在D列上,请参阅“实际行为类”,以获取A到C列的值,在E,G,I,K,M和O列上,该算法根据行的训练[1: 150],用于不同的K值。
任何帮助都将受到感激!!!
答案 0 :(得分:1)
在KNN中,发现K
并非易事,K
的值较小意味着噪声将对结果产生更大的影响,而较大的值会使计算变得昂贵。
我通常看到人们使用:K = SQRT(N)
。但是,如果您不想针对自己的情况找到更好的K
,请使用Carret软件包中的KNN,这是一个示例:
library(ISLR)
library(caret)
# Split the data:
data(iris)
indxTrain <- createDataPartition(y = iris$Sepal.Length,p = 0.75,list = FALSE)
training <- iris[indxTrain,]
testing <- iris[-indxTrain,]
# Run k-NN:
set.seed(400)
ctrl <- trainControl(method="repeatedcv",repeats = 3)
knnFit <- train(Species ~ ., data = training, method = "knn", trControl = ctrl, preProcess = c("center","scale"),tuneLength = 20)
knnFit
#Use plots to see optimal number of clusters:
#Plotting yields Number of Neighbours Vs accuracy (based on repeated cross validation)
plot(knnFit)
这表明5具有最高的准确率,因此K
的值为5。