我继续使用matrix.partion对数据进行分区,并且得到的东西不在我的训练集中,而在我的测试集中。它不断获取所有值并放置测试集。有防止这种情况的简单代码方法吗?
更新。从Matrix分区切换到建议的代码后,我获得了包括代码在内的以下内容。我终于不知所措了。如果我使用分区,我似乎会保留级别,但是冒着将一些不在培训中的东西放到测试集中的风险。当我尝试这种方法时,会收到此消息。我不确定现在要去哪里修复它。
> library(leaps)
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
RStudio Community is a great place to get help: https://community.rstudio.com/c/tidyverse.
> library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
> studentreport<-read.csv("C:\\Users\\Joseph\\Downloads\\studentreport dataset full imp.csv",header=T,sep=",")
> studentreport<-data.frame(studentreport)
>
> set.seed(123)
> smp_size = 7239
> training<- sample_n(studentreport,smp_size)
> testing<- setdiff(studentreport,training_data)
Error in setdiff_data_frame(x, y) : object 'training_data' not found
> testing<- setdiff(studentreport,training)
> str(training)
'data.frame': 7239 obs. of 13 variables:
$ Enrolling: logi FALSE TRUE TRUE FALSE FALSE FALSE ...
$ School : Factor w/ 2480 levels "A C Flora High School",..: 953 1191 1951 354 2159 32 677 8 870 1986 ...
$ State : Factor w/ 49 levels "AE","AL","AR",..: 40 40 28 34 38 40 39 40 31 40 ...
$ age : int 17 18 19 18 18 18 18 18 18 18 ...
$ Gender : Factor w/ 4 levels "Female","Male",..: 1 1 1 2 2 2 1 2 2 1 ...
$ Race : Factor w/ 7 levels "A","B","C","D",..: 1 1 1 7 6 4 7 1 1 1 ...
$ Major : Factor w/ 62 levels "Accounting","African American Studies",..: 10 11 23 60 38 50 20 55 1 60 ...
$ ACT : int 25 21 28 25 25 18 25 25 25 16 ...
$ SAT : num 1810 910 1625 1625 1790 ...
$ Rank : num 8 132 60 60 60 57 26 60 60 130 ...
$ CSize : int 329 397 337 337 337 270 131 337 337 430 ...
$ GPA : num 4.88 4.08 4.88 2.87 3.2 ...
$ GPAType : Factor w/ 3 levels "not known","Unweighted",..: 3 3 3 3 3 3 3 3 3 3 ...
> str(testing)
'data.frame': 2414 obs. of 13 variables:
$ Enrolling: logi TRUE FALSE FALSE FALSE FALSE FALSE ...
$ School : Factor w/ 2480 levels "A C Flora High School",..: 350 1962 281 2317 423 2013 518 1767 1614 1613 ...
$ State : Factor w/ 49 levels "AE","AL","AR",..: 44 34 20 20 20 20 23 31 5 9 ...
$ age : int 18 18 18 19 18 18 18 18 19 19 ...
$ Gender : Factor w/ 4 levels "Female","Male",..: 1 2 1 1 1 1 2 1 1 1 ...
$ Race : Factor w/ 7 levels "A","B","C","D",..: 7 1 1 7 7 1 6 7 1 7 ...
$ Major : Factor w/ 62 levels "Accounting","African American Studies",..: 23 10 19 24 10 60 11 60 14 20 ...
$ ACT : int 22 25 25 25 25 22 25 25 27 25 ...
$ SAT : num 1390 1540 1570 1430 1590 ...
$ Rank : num 60 60 60 60 60 60 60 60 60 60 ...
$ CSize : int 337 337 337 337 337 337 337 337 337 337 ...
$ GPA : num 3.8 3.22 3.4 3.39 3.4 ...
$ GPAType : Factor w/ 3 levels "not known","Unweighted",..: 3 2 3 3 3 2 3 3 2 3 ...
> fitreport<-glm(Enrolling~.,train,family="binomial")
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
> itstart=glm(Enrolling~1,data=training,family="binomial")
> Fitstart=glm(Enrolling~1,data=training,family="binomial")
>
> Report<-step(Fitstart,direction="forward",scope=formula(fitreport))
Start: AIC=7463.71
Enrolling ~ 1
Df Deviance AIC
+ State 48 7186.8 7284.8
+ ACT 1 7362.0 7366.0
+ Rank 1 7419.7 7423.7
+ GPA 1 7443.7 7447.7
+ CSize 1 7457.4 7461.4
+ GPAType 1 7457.9 7461.9
<none> 7461.7 7463.7
+ Gender 3 7455.8 7463.8
+ age 1 7460.1 7464.1
+ SAT 1 7460.2 7464.2
+ Race 6 7452.6 7466.6
+ Major 61 7363.5 7487.5
+ School 2150 5074.8 9376.8
Step: AIC=7284.83
Enrolling ~ State
Df Deviance AIC
+ Rank 1 7149.0 7249.0
+ ACT 1 7149.2 7249.2
+ GPA 1 7167.3 7267.3
+ CSize 1 7182.6 7282.6
+ age 1 7183.4 7283.4
<none> 7186.8 7284.8
+ SAT 1 7185.4 7285.4
+ Gender 3 7181.4 7285.4
+ GPAType 1 7186.4 7286.4
+ Race 6 7176.9 7286.9
+ Major 61 7089.7 7309.7
+ School 2141 5300.4 9680.4
Step: AIC=7248.99
Enrolling ~ State + Rank
Df Deviance AIC
+ ACT 1 7117.9 7219.9
+ GPA 1 7143.7 7245.7
+ CSize 1 7144.9 7246.9
+ age 1 7145.2 7247.2
<none> 7149.0 7249.0
+ SAT 1 7147.5 7249.5
+ GPAType 1 7148.5 7250.5
+ Gender 3 7145.1 7251.1
+ Race 6 7140.2 7252.2
+ Major 61 7058.0 7280.0
+ School 2142 5152.9 9536.9
Step: AIC=7219.89
Enrolling ~ State + Rank + ACT
Df Deviance AIC
+ age 1 7114.4 7218.4
<none> 7117.9 7219.9
+ CSize 1 7116.3 7220.3
+ SAT 1 7116.4 7220.4
+ GPA 1 7116.9 7220.9
+ Gender 3 7113.3 7221.3
+ GPAType 1 7117.3 7221.3
+ Race 6 7108.2 7222.2
+ Major 61 7022.6 7246.6
+ School 2141 6205.7 10589.7
Step: AIC=7218.37
Enrolling ~ State + Rank + ACT + age
Df Deviance AIC
<none> 7114.4 7218.4
+ CSize 1 7112.7 7218.7
+ SAT 1 7112.9 7218.9
+ GPA 1 7113.6 7219.6
+ GPAType 1 7113.8 7219.8
+ Gender 3 7110.2 7220.2
+ Race 6 7104.7 7220.7
+ Major 61 7019.2 7245.2
+ School 2142 8281.6 12669.6
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
3: glm.fit: algorithm did not converge
4: glm.fit: fitted probabilities numerically 0 or 1 occurred
5: glm.fit: algorithm did not converge
6: glm.fit: fitted probabilities numerically 0 or 1 occurred
7: glm.fit: algorithm did not converge
8: glm.fit: fitted probabilities numerically 0 or 1 occurred
> Modelout<-predict(Report,newdata=testing,type="response")
> formula(Report)
Enrolling ~ State + Rank + ACT + age
> confusionMatrix(Modelout,testing$Enrolling,positive=1)
Error: `data` and `reference` should be factors with the same levels.
> confusionMatrix(Modelout,testing,positive=1)
Error: `data` and `reference` should be factors with the same levels.
> > testresults<- ifelse(Modelout> 0.5,TRUE,FALSE)
> confusionMatrix(testresults,testing,positive=1)
Error: `data` and `reference` should be factors with the same levels.
> confusionMatrix(testresults,testing$Enrolling,positive=1)
Error: `data` and `reference` should be factors with the same levels.
> confusionMatrix(testresults,testing$Enrolling)
答案 0 :(得分:0)
有一种非常简单的方法可以将数据分为训练数据和测试数据。
library(dplyr)
data(iris)
smp_size = 100
training_data <- sample_n(iris,smp_size)
test_data <- setdiff(iris,training_data)