我有45212个元素的数据集,有17列,我想使用kNN算法找到最后一列的类标签,据我说一切都好,但我总是想出错误
"Error in knn(train = data_train, test = data_test, cl = data_train_labels, :
no missing values are allowed"
这是我的代码
> data_train <-data[1:25000,]
> data_test <-data[25001:45212,]
> data_train_labels <- data[1:25000, 17]
> data_test_labels <- data[1:25000, 17]
> install.package("class")
> library(class)
> data_test_pred <- knn(train=data_train, test=data_test, cl=data_train_labels, k=10)
这是我的数据集的样子:
age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
35,management,married,tertiary,no,231,yes,no,unknown,5,may,139,1,-1,0,unknown,no
28,management,single,tertiary,no,447,yes,yes,unknown,5,may,217,1,-1,0,unknown,no
42,entrepreneur,divorced,tertiary,yes,2,yes,no,unknown,5,may,380,1,-1,0,unknown,no
58,retired,married,primary,no,121,yes,no,unknown,5,may,50,1,-1,0,unknown,no
43,technician,single,secondary,no,593,yes,no,unknown,5,may,55,1,-1,0,unknown,no
41,admin.,divorced,secondary,no,270,yes,no,unknown,5,may,222,1,-1,0,unknown,no
答案 0 :(得分:1)
我认为您的问题是数据中的所有因素。 knn文档说它使用了欧几里德距离,这对因子没有意义。如果你真的想使用knn,这是一个可能的解决方案。您可以使用群集程序包中的daisy
获取点之间的距离矩阵。在R中有几种knn的实现,但我不知道接受距离矩阵的那种。您可以编写自己的(不是那么困难),也可以使用cmdscale
将距离矩阵映射到欧几里德空间。然后在投影空间上使用knn。
答案 1 :(得分:1)
我相信你的错误是:data_train&lt; -data [1:25000,]
您要包含尚未规范化的标题。我能够重现同样的错误。但当我改为data_train&lt; -data [2:25000,]它运行正常。