Question

我知道在很多情况下都可以在线回答这个问题，但是由于它是如此依赖于数据集，所以我想知道是否存在一种使用相对简单的数据集在KNN算法中找到最佳K值的简单方法。

我的响应变量是行为类别（列E：事件），预测变量是活动传感器的三个轴（列B至D）。 sample是我的数据的样子。

在我编写的用于运行knn分析的代码下面找到。 datanet对象看起来就像我上传的示例图像。我将前150行用作训练，其余[151至240]行用作测试。

在这种情况下，我使用的K值为10，但是在为不同的K值运行脚本之后，我显然得到了不同的输出，因此想知道选择K值的最佳方法是什么？最适合我的数据集。特别是，我需要使用R进行编码的帮助。

library(data.table)

#From the file "Collar_#.txt", just select the columns ACTIVITY_X, ACTIVITY_Y, ACTIVITY_Z and Event
dataraw<-fread("Collar_41361.txt", select = c("ACTIVITY_X","ACTIVITY_Y","ACTIVITY_Z","Event"))

#Now, delete all rows containg the string "End"
datanet<-dataraw[!grepl("End", dataraw$Event),]

#Then, read only the columns ACTIVITY_X, ACTIVITY_Y and ACTIVITY_Z for a selected interval that will act as a trainning set
trainset <- datanet[1:150, !"Event"]
View(trainset)

#Create the behavioural classes. Note that the number of rows should be in the same interval as the trainset dataset
behaviour<-datanet[1:150,!1:3]
View(behaviour)

#Test file. This file contains sensor data only, and behaviours would be associated based on the trainset and behaviour datasets
testset<-datanet[151:240,!"Event"]
View(testset)

#Converting inputs into matrix
train = as.matrix(trainset, byrow = T, ncol=3)
test = as.matrix(testset, byrow = T, ncol=3)
classes=as.matrix(behaviour,byrow=T,ncol=1)

library(stats)
library(class)

#Now running the algorithm. But first we set the k value.

for kk=10

kn1 = knn(train, test, classes, k=kk, prob=TRUE)

prob = attributes(.Last.value)
clas1=factor(kn1)

#Write results, this is the classification of the testing set in a single column
filename = paste("results", kk, ".csv", sep="")
write.csv(clas1, filename)

#Write probs to file, this is the proportion of k nearest datapoints that contributed to the winning class
fileprobs = paste("probs", kk, ".csv", sep="")
write.csv (prob$prob, fileprobs)

我还上传了脚本输出的示例image。在D列上，请参阅“实际行为类”，以获取A到C列的值，在E，G，I，K，M和O列上，该算法根据行的训练[1： 150]，用于不同的K值。

任何帮助都将受到感激！！！

Answer 1

在KNN中，发现K并非易事，K的值较小意味着噪声将对结果产生更大的影响，而较大的值会使计算变得昂贵。

我通常看到人们使用：K = SQRT(N)。但是，如果您不想针对自己的情况找到更好的K，请使用Carret软件包中的KNN，这是一个示例：

library(ISLR)
library(caret)

# Split the data:
data(iris)
indxTrain <- createDataPartition(y = iris$Sepal.Length,p = 0.75,list = FALSE)
training <- iris[indxTrain,]
testing <- iris[-indxTrain,]

# Run k-NN:
set.seed(400)
ctrl <- trainControl(method="repeatedcv",repeats = 3)
knnFit <- train(Species ~ ., data = training, method = "knn", trControl = ctrl, preProcess = c("center","scale"),tuneLength = 20)
knnFit

#Use plots to see optimal number of clusters:
#Plotting yields Number of Neighbours Vs accuracy (based on repeated cross validation)
plot(knnFit)

这表明5具有最高的准确率，因此K的值为5。

使用R中的简单数据集在KNN中选择K值

1 个答案: