使用R中的简单数据集在KNN中选择K值

时间:2019-01-30 17:20:50

标签: r algorithm machine-learning knn

我知道在很多情况下都可以在线回答这个问题,但是由于它是如此依赖于数据集,所以我想知道是否存在一种使用相对简单的数据集在KNN算法中找到最佳K值的简单方法。

我的响应变量是行为类别(列E:事件),预测变量是活动传感器的三个轴(列B至D)。 sample是我的数据的样子。

在我编写的用于运行knn分析的代码下面找到。 datanet对象看起来就像我上传的示例图像。我将前150行用作训练,其余[151至240]行用作测试。

在这种情况下,我使用的K值为10,但是在为不同的K值运行脚本之后,我显然得到了不同的输出,因此想知道选择K值的最佳方法是什么?最适合我的数据集。特别是,我需要使用R进行编码的帮助。

library(data.table)

#From the file "Collar_#.txt", just select the columns ACTIVITY_X, ACTIVITY_Y, ACTIVITY_Z and Event
dataraw<-fread("Collar_41361.txt", select = c("ACTIVITY_X","ACTIVITY_Y","ACTIVITY_Z","Event"))

#Now, delete all rows containg the string "End"
datanet<-dataraw[!grepl("End", dataraw$Event),]

#Then, read only the columns ACTIVITY_X, ACTIVITY_Y and ACTIVITY_Z for a selected interval that will act as a trainning set
trainset <- datanet[1:150, !"Event"]
View(trainset)

#Create the behavioural classes. Note that the number of rows should be in the same interval as the trainset dataset
behaviour<-datanet[1:150,!1:3]
View(behaviour)

#Test file. This file contains sensor data only, and behaviours would be associated based on the trainset and behaviour datasets
testset<-datanet[151:240,!"Event"]
View(testset)

#Converting inputs into matrix
train = as.matrix(trainset, byrow = T, ncol=3)
test = as.matrix(testset, byrow = T, ncol=3)
classes=as.matrix(behaviour,byrow=T,ncol=1)

library(stats)
library(class)

#Now running the algorithm. But first we set the k value.

for kk=10

kn1 = knn(train, test, classes, k=kk, prob=TRUE)

prob = attributes(.Last.value)
clas1=factor(kn1)

#Write results, this is the classification of the testing set in a single column
filename = paste("results", kk, ".csv", sep="")
write.csv(clas1, filename)

#Write probs to file, this is the proportion of k nearest datapoints that contributed to the winning class
fileprobs = paste("probs", kk, ".csv", sep="")
write.csv (prob$prob, fileprobs)

我还上传了脚本输出的示例image。在D列上,请参阅“实际行为类”,以获取A到C列的值,在E,G,I,K,M和O列上,该算法根据行的训练[1: 150],用于不同的K值。

任何帮助都将受到感激!!!

1 个答案:

答案 0 :(得分:1)

在KNN中,发现K并非易事,K的值较小意味着噪声将对结果产生更大的影响,而较大的值会使计算变得昂贵。

我通常看到人们使用:K = SQRT(N)。但是,如果您不想针对自己的情况找到更好的K,请使用Carret软件包中的KNN,这是一个示例:

library(ISLR)
library(caret)

# Split the data:
data(iris)
indxTrain <- createDataPartition(y = iris$Sepal.Length,p = 0.75,list = FALSE)
training <- iris[indxTrain,]
testing <- iris[-indxTrain,]

# Run k-NN:
set.seed(400)
ctrl <- trainControl(method="repeatedcv",repeats = 3)
knnFit <- train(Species ~ ., data = training, method = "knn", trControl = ctrl, preProcess = c("center","scale"),tuneLength = 20)
knnFit

#Use plots to see optimal number of clusters:
#Plotting yields Number of Neighbours Vs accuracy (based on repeated cross validation)
plot(knnFit)

enter image description here

这表明5具有最高的准确率,因此K的值为5。