测试和培训AR中的集划分问题

时间:2018-08-05 03:51:13

标签: r partitioning sampling confusion-matrix

我继续使用matrix.partion对数据进行分区,并且得到的东西不在我的训练集中,而在我的测试集中。它不断获取所有值并放置测试集。有防止这种情况的简单代码方法吗?

更新。从Matrix分区切换到建议的代码后,我获得了包括代码在内的以下内容。我终于不知所措了。如果我使用分区,我似乎会保留级别,但是冒着将一些不在培训中的东西放到测试集中的风险。当我尝试这种方法时,会收到此消息。我不确定现在要去哪里修复它。

> library(leaps)
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
RStudio Community is a great place to get help: https://community.rstudio.com/c/tidyverse.
> library(dplyr)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

> studentreport<-read.csv("C:\\Users\\Joseph\\Downloads\\studentreport dataset full imp.csv",header=T,sep=",")
> studentreport<-data.frame(studentreport)
> 
> set.seed(123)
> smp_size = 7239
> training<- sample_n(studentreport,smp_size)
> testing<- setdiff(studentreport,training_data)
Error in setdiff_data_frame(x, y) : object 'training_data' not found
> testing<- setdiff(studentreport,training)
> str(training)
'data.frame':   7239 obs. of  13 variables:
 $ Enrolling: logi  FALSE TRUE TRUE FALSE FALSE FALSE ...
 $ School   : Factor w/ 2480 levels "A C Flora High School",..: 953 1191 1951 354 2159 32 677 8 870 1986 ...
 $ State    : Factor w/ 49 levels "AE","AL","AR",..: 40 40 28 34 38 40 39 40 31 40 ...
 $ age      : int  17 18 19 18 18 18 18 18 18 18 ...
 $ Gender   : Factor w/ 4 levels "Female","Male",..: 1 1 1 2 2 2 1 2 2 1 ...
 $ Race     : Factor w/ 7 levels "A","B","C","D",..: 1 1 1 7 6 4 7 1 1 1 ...
 $ Major    : Factor w/ 62 levels "Accounting","African American Studies",..: 10 11 23 60 38 50 20 55 1 60 ...
 $ ACT      : int  25 21 28 25 25 18 25 25 25 16 ...
 $ SAT      : num  1810 910 1625 1625 1790 ...
 $ Rank     : num  8 132 60 60 60 57 26 60 60 130 ...
 $ CSize    : int  329 397 337 337 337 270 131 337 337 430 ...
 $ GPA      : num  4.88 4.08 4.88 2.87 3.2 ...
 $ GPAType  : Factor w/ 3 levels "not known","Unweighted",..: 3 3 3 3 3 3 3 3 3 3 ...
> str(testing)
'data.frame':   2414 obs. of  13 variables:
 $ Enrolling: logi  TRUE FALSE FALSE FALSE FALSE FALSE ...
 $ School   : Factor w/ 2480 levels "A C Flora High School",..: 350 1962 281 2317 423 2013 518 1767 1614 1613 ...
 $ State    : Factor w/ 49 levels "AE","AL","AR",..: 44 34 20 20 20 20 23 31 5 9 ...
 $ age      : int  18 18 18 19 18 18 18 18 19 19 ...
 $ Gender   : Factor w/ 4 levels "Female","Male",..: 1 2 1 1 1 1 2 1 1 1 ...
 $ Race     : Factor w/ 7 levels "A","B","C","D",..: 7 1 1 7 7 1 6 7 1 7 ...
 $ Major    : Factor w/ 62 levels "Accounting","African American Studies",..: 23 10 19 24 10 60 11 60 14 20 ...
 $ ACT      : int  22 25 25 25 25 22 25 25 27 25 ...
 $ SAT      : num  1390 1540 1570 1430 1590 ...
 $ Rank     : num  60 60 60 60 60 60 60 60 60 60 ...
 $ CSize    : int  337 337 337 337 337 337 337 337 337 337 ...
 $ GPA      : num  3.8 3.22 3.4 3.39 3.4 ...
 $ GPAType  : Factor w/ 3 levels "not known","Unweighted",..: 3 2 3 3 3 2 3 3 2 3 ...
> fitreport<-glm(Enrolling~.,train,family="binomial")
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred 
> itstart=glm(Enrolling~1,data=training,family="binomial")
> Fitstart=glm(Enrolling~1,data=training,family="binomial")
> 
> Report<-step(Fitstart,direction="forward",scope=formula(fitreport))
Start:  AIC=7463.71
Enrolling ~ 1

            Df Deviance    AIC
+ State     48   7186.8 7284.8
+ ACT        1   7362.0 7366.0
+ Rank       1   7419.7 7423.7
+ GPA        1   7443.7 7447.7
+ CSize      1   7457.4 7461.4
+ GPAType    1   7457.9 7461.9
<none>           7461.7 7463.7
+ Gender     3   7455.8 7463.8
+ age        1   7460.1 7464.1
+ SAT        1   7460.2 7464.2
+ Race       6   7452.6 7466.6
+ Major     61   7363.5 7487.5
+ School  2150   5074.8 9376.8

Step:  AIC=7284.83
Enrolling ~ State

            Df Deviance    AIC
+ Rank       1   7149.0 7249.0
+ ACT        1   7149.2 7249.2
+ GPA        1   7167.3 7267.3
+ CSize      1   7182.6 7282.6
+ age        1   7183.4 7283.4
<none>           7186.8 7284.8
+ SAT        1   7185.4 7285.4
+ Gender     3   7181.4 7285.4
+ GPAType    1   7186.4 7286.4
+ Race       6   7176.9 7286.9
+ Major     61   7089.7 7309.7
+ School  2141   5300.4 9680.4

Step:  AIC=7248.99
Enrolling ~ State + Rank

            Df Deviance    AIC
+ ACT        1   7117.9 7219.9
+ GPA        1   7143.7 7245.7
+ CSize      1   7144.9 7246.9
+ age        1   7145.2 7247.2
<none>           7149.0 7249.0
+ SAT        1   7147.5 7249.5
+ GPAType    1   7148.5 7250.5
+ Gender     3   7145.1 7251.1
+ Race       6   7140.2 7252.2
+ Major     61   7058.0 7280.0
+ School  2142   5152.9 9536.9

Step:  AIC=7219.89
Enrolling ~ State + Rank + ACT

            Df Deviance     AIC
+ age        1   7114.4  7218.4
<none>           7117.9  7219.9
+ CSize      1   7116.3  7220.3
+ SAT        1   7116.4  7220.4
+ GPA        1   7116.9  7220.9
+ Gender     3   7113.3  7221.3
+ GPAType    1   7117.3  7221.3
+ Race       6   7108.2  7222.2
+ Major     61   7022.6  7246.6
+ School  2141   6205.7 10589.7

Step:  AIC=7218.37
Enrolling ~ State + Rank + ACT + age

            Df Deviance     AIC
<none>           7114.4  7218.4
+ CSize      1   7112.7  7218.7
+ SAT        1   7112.9  7218.9
+ GPA        1   7113.6  7219.6
+ GPAType    1   7113.8  7219.8
+ Gender     3   7110.2  7220.2
+ Race       6   7104.7  7220.7
+ Major     61   7019.2  7245.2
+ School  2142   8281.6 12669.6
Warning messages:
1: glm.fit: algorithm did not converge 
2: glm.fit: fitted probabilities numerically 0 or 1 occurred 
3: glm.fit: algorithm did not converge 
4: glm.fit: fitted probabilities numerically 0 or 1 occurred 
5: glm.fit: algorithm did not converge 
6: glm.fit: fitted probabilities numerically 0 or 1 occurred 
7: glm.fit: algorithm did not converge 
8: glm.fit: fitted probabilities numerically 0 or 1 occurred 


 > Modelout<-predict(Report,newdata=testing,type="response")
    > formula(Report)
    Enrolling ~ State + Rank + ACT + age
    > confusionMatrix(Modelout,testing$Enrolling,positive=1)
    Error: `data` and `reference` should be factors with the same levels.
    > confusionMatrix(Modelout,testing,positive=1)
    Error: `data` and `reference` should be factors with the same levels.
    > > testresults<- ifelse(Modelout> 0.5,TRUE,FALSE)
    > confusionMatrix(testresults,testing,positive=1)
    Error: `data` and `reference` should be factors with the same levels.
    > confusionMatrix(testresults,testing$Enrolling,positive=1)
    Error: `data` and `reference` should be factors with the same levels.
    > confusionMatrix(testresults,testing$Enrolling)

1 个答案:

答案 0 :(得分:0)

有一种非常简单的方法可以将数据分为训练数据和测试数据。

library(dplyr)
data(iris)
smp_size = 100
training_data <- sample_n(iris,smp_size)
test_data <- setdiff(iris,training_data)