Question

我在名为数据集的数据框中有一个因子字段性别。根据我的知识，因子与C中的枚举类似，即每个名称都映射到一个数字。

> dataset$Gender <- as.factor(dataset$Gender)
> str(dataset$Gender)
 Factor w/ 2 levels "Female","Male": 2 2 1 1 2 2 1 1 2 1 ...

现在，在执行K-Nearest Neighbor时，如果我将此字段作为自变量传递，则会抛出错误。

现在，如果我为此因子字段提供标签，那么一切顺利： -

> dataset$Gender <- factor(dataset$Gender , levels = c("Female","Male") , labels = c(0,1))
> str(dataset$Gender)
 Factor w/ 2 levels "0","1": 2 2 1 1 2 2 1 1 2 1 ...

标签做了什么改变。 是否为男性和女性提供了一些数字权重，这有助于计算欧几里德距离。如果是这样的话，为什么欧几里德距离不是根据因子本身所做的映射来计算的：女性：1和男性：2当没有提供标签时。为什么没有这个女性的映射：1和男性：2在欧几里德距离计算中工作。

数据集

> head(dataset)
   User.ID Gender Age EstimatedSalary Purchased
1 15624510   Male  19           19000         0
2 15810944   Male  35           20000         0
3 15668575 Female  26           43000         0
4 15603246 Female  27           57000         0
5 15804002   Male  19           76000         0
6 15728773   Male  27           58000         0

错误示例

dataset <- read.csv("~/Desktop/Machine Learning /ML_16/Social_Network_Ads.csv")

library(caTools)

set.seed(1231)



sample_split <- sample.split(dataset$Gender , SplitRatio = 0.8)

training_dataset <- subset(dataset , sample_split == TRUE)

testing_dataset <- subset(dataset , sample_split == FALSE)


library(class)

model_classifier <- knn(train = training_dataset[,-5] , test = testing_dataset[,-5] , cl = training_dataset$Purchased , k = 21 )

library(caret)

confusionMatrix(table(model_classifier , testing_dataset$Purchased))

错误

Error in knn(train = training_dataset[, -5], test = testing_dataset[,  : 
  NA/NaN/Inf in foreign function call (arg 6)
In addition: Warning messages:
1: In knn(train = training_dataset[, -5], test = testing_dataset[,  :
  NAs introduced by coercion
2: In knn(train = training_dataset[, -5], test = testing_dataset[,  :
  NAs introduced by coercion

分配标签后

dataset <- read.csv("~/Desktop/Machine Learning /ML_16/Social_Network_Ads.csv")


dataset$Gender <- factor(dataset$Gender , levels = c("Female","Male") , labels = c(0 , 1))



library(caTools)

set.seed(1231)



sample_split <- sample.split(dataset$Gender , SplitRatio = 0.8)

training_dataset <- subset(dataset , sample_split == TRUE)

testing_dataset <- subset(dataset , sample_split == FALSE)


library(class)

model_classifier <- knn(train = training_dataset[,-5] , test = testing_dataset[,-5] , cl = training_dataset$Purchased , k = 21 )

library(caret)

confusionMatrix(table(model_classifier , testing_dataset$Purchased))

为R

0 个答案: