我在名为数据集的数据框中有一个因子字段性别。根据我的知识,因子与C中的枚举类似,即每个名称都映射到一个数字。
> dataset$Gender <- as.factor(dataset$Gender)
> str(dataset$Gender)
Factor w/ 2 levels "Female","Male": 2 2 1 1 2 2 1 1 2 1 ...
现在,在执行K-Nearest Neighbor时,如果我将此字段作为自变量传递,则会抛出错误。
现在,如果我为此因子字段提供标签,那么一切顺利: -
> dataset$Gender <- factor(dataset$Gender , levels = c("Female","Male") , labels = c(0,1))
> str(dataset$Gender)
Factor w/ 2 levels "0","1": 2 2 1 1 2 2 1 1 2 1 ...
标签做了什么改变。 是否为男性和女性提供了一些数字权重,这有助于计算欧几里德距离。 如果是这样的话,为什么欧几里德距离不是根据因子本身所做的映射来计算的:女性:1和男性:2当没有提供标签时。为什么没有这个女性的映射:1和男性:2在欧几里德距离计算中工作。
数据集
> head(dataset)
User.ID Gender Age EstimatedSalary Purchased
1 15624510 Male 19 19000 0
2 15810944 Male 35 20000 0
3 15668575 Female 26 43000 0
4 15603246 Female 27 57000 0
5 15804002 Male 19 76000 0
6 15728773 Male 27 58000 0
错误示例
dataset <- read.csv("~/Desktop/Machine Learning /ML_16/Social_Network_Ads.csv")
library(caTools)
set.seed(1231)
sample_split <- sample.split(dataset$Gender , SplitRatio = 0.8)
training_dataset <- subset(dataset , sample_split == TRUE)
testing_dataset <- subset(dataset , sample_split == FALSE)
library(class)
model_classifier <- knn(train = training_dataset[,-5] , test = testing_dataset[,-5] , cl = training_dataset$Purchased , k = 21 )
library(caret)
confusionMatrix(table(model_classifier , testing_dataset$Purchased))
错误
Error in knn(train = training_dataset[, -5], test = testing_dataset[, :
NA/NaN/Inf in foreign function call (arg 6)
In addition: Warning messages:
1: In knn(train = training_dataset[, -5], test = testing_dataset[, :
NAs introduced by coercion
2: In knn(train = training_dataset[, -5], test = testing_dataset[, :
NAs introduced by coercion
分配标签后
dataset <- read.csv("~/Desktop/Machine Learning /ML_16/Social_Network_Ads.csv")
dataset$Gender <- factor(dataset$Gender , levels = c("Female","Male") , labels = c(0 , 1))
library(caTools)
set.seed(1231)
sample_split <- sample.split(dataset$Gender , SplitRatio = 0.8)
training_dataset <- subset(dataset , sample_split == TRUE)
testing_dataset <- subset(dataset , sample_split == FALSE)
library(class)
model_classifier <- knn(train = training_dataset[,-5] , test = testing_dataset[,-5] , cl = training_dataset$Purchased , k = 21 )
library(caret)
confusionMatrix(table(model_classifier , testing_dataset$Purchased))