为R

时间:2018-03-07 15:34:28

标签: r machine-learning

我在名为数据集的数据框中有一个因子字段性别。根据我的知识,因子与C中的枚举类似,即每个名称都映射到一个数字。

> dataset$Gender <- as.factor(dataset$Gender)
> str(dataset$Gender)
 Factor w/ 2 levels "Female","Male": 2 2 1 1 2 2 1 1 2 1 ...

现在,在执行K-Nearest Neighbor时,如果我将此字段作为自变量传递,则会抛出错误。

现在,如果我为此因子字段提供标签,那么一切顺利: -

> dataset$Gender <- factor(dataset$Gender , levels = c("Female","Male") , labels = c(0,1))
> str(dataset$Gender)
 Factor w/ 2 levels "0","1": 2 2 1 1 2 2 1 1 2 1 ...

标签做了什么改变。 是否为男性和女性提供了一些数字权重,这有助于计算欧几里德距离。 如果是这样的话,为什么欧几里德距离不是根据因子本身所做的映射来计算的:女性:1和男性:2当没有提供标签时。为什么没有这个女性的映射:1和男性:2在欧几里德距离计算中工作。

数据集

> head(dataset)
   User.ID Gender Age EstimatedSalary Purchased
1 15624510   Male  19           19000         0
2 15810944   Male  35           20000         0
3 15668575 Female  26           43000         0
4 15603246 Female  27           57000         0
5 15804002   Male  19           76000         0
6 15728773   Male  27           58000         0

错误示例

dataset <- read.csv("~/Desktop/Machine Learning /ML_16/Social_Network_Ads.csv")

library(caTools)

set.seed(1231)



sample_split <- sample.split(dataset$Gender , SplitRatio = 0.8)

training_dataset <- subset(dataset , sample_split == TRUE)

testing_dataset <- subset(dataset , sample_split == FALSE)


library(class)

model_classifier <- knn(train = training_dataset[,-5] , test = testing_dataset[,-5] , cl = training_dataset$Purchased , k = 21 )

library(caret)

confusionMatrix(table(model_classifier , testing_dataset$Purchased))

错误

Error in knn(train = training_dataset[, -5], test = testing_dataset[,  : 
  NA/NaN/Inf in foreign function call (arg 6)
In addition: Warning messages:
1: In knn(train = training_dataset[, -5], test = testing_dataset[,  :
  NAs introduced by coercion
2: In knn(train = training_dataset[, -5], test = testing_dataset[,  :
  NAs introduced by coercion

分配标签后

dataset <- read.csv("~/Desktop/Machine Learning /ML_16/Social_Network_Ads.csv")


dataset$Gender <- factor(dataset$Gender , levels = c("Female","Male") , labels = c(0 , 1))



library(caTools)

set.seed(1231)



sample_split <- sample.split(dataset$Gender , SplitRatio = 0.8)

training_dataset <- subset(dataset , sample_split == TRUE)

testing_dataset <- subset(dataset , sample_split == FALSE)


library(class)

model_classifier <- knn(train = training_dataset[,-5] , test = testing_dataset[,-5] , cl = training_dataset$Purchased , k = 21 )

library(caret)

confusionMatrix(table(model_classifier , testing_dataset$Purchased))

0 个答案:

没有答案