进行数据分区时会更改列变量

时间:2015-11-26 16:14:16

标签: r machine-learning r-caret

我有以下数据集:

head(filter_selection)
MATCHID COMPETITION            TEAM1              TEAM2 GOALS1 GOALS2 RESULT    EXPG1 EXPG2     DATUM     TIJD VERSCHIL
1 1696873  Pro League   Standard Liège Sporting Charleroi      3      0  TEAM1  1.57  0.61 25-7-2014 18:30:00     0.96
2 1696883  Pro League Waasland-Beveren        Club Brugge      0      2  TEAM2  1.29  1.18 26-7-2014 16:00:00     0.11
3 1696879  Pro League           Lierse        KV Oostende      2      0  TEAM1  1.03  1.04 26-7-2014 18:00:00    -0.01
4 1696881  Pro League         Westerlo            Lokeren      1      0  TEAM1  1.76  1.24 26-7-2014 18:00:00     0.52
5 1696877  Pro League         Mechelen               Genk      3      1  TEAM1  1.60  1.23 27-7-2014 12:30:00     0.37
6 1696871  Pro League       Anderlecht  Mouscron-Péruwelz      3      1  TEAM1  1.27  0.62 27-7-2014 16:00:00     0.65

我想使用VERSCHIL值来预测RESULT。因此,我执行以下操作来创建测试/培训集:

library(rcaret)
inTrain <- createDataPartition(y=filter_selection$RESULT, p=0.75, list=FALSE)

然而,当我这样做时,我的RESULT列会改变:

training <- df_final_test[inTrain, ]
testing <- df_final_test[-inTrain, ]
head(training, 20)

MATCHID   COMPETITION              TEAM1              TEAM2 GOALS1 GOALS2  RESULT EXPG1 EXPG2     DATUM     TIJD VERSCHIL CLAS type           TYPE  TYPE2
1  1696873    Pro League     Standard Liège Sporting Charleroi      3      0          3  1.57  0.61 25-7-2014 18:30:00     0.96 0.96  TBD (-0.0767,1.54]   HIGH
2  1696883    Pro League   Waasland-Beveren        Club Brugge      0      2         4  1.29  1.18 26-7-2014 16:00:00     0.11 0.11  TBD (-0.0767,1.54] MEDIUM

现在分别是3和4而不是TEAM1和TEAM2。谁能告诉我为什么TEAM1值变成了3?

当我对垃圾邮件数据集执行相同操作时,它的奇怪原因

data(spam)
inTrain <- createDataPartition(y=spam$type, p=0.75, list=FALSE)
training <- spam[inTrain, ]
head(training)

考虑到这些类是相同的

 class(spam$type)
 [1] "factor"
 class(filter_selection$RESULT)
 [1] "factor"

1 个答案:

答案 0 :(得分:0)

首先,没有包装rcaret。 其次,在“filter_selection”上创建数据分区,然后根据不同的数据框“df_final_test”创建训练和测试集。

但请检查df_final_test $ RESULT的结构,看看该因子有多少级别。也许那里出了点问题。如果有任何级别,您不想使用droplevels(df_final_test$RESULT)

如果我尝试使用filter_selection上的代码并创建一个训练集,我会得到一个正确的训练和测试集。

library(caret)
inTrain <- createDataPartition(y=filter_selection$RESULT, p=0.75, list=FALSE)

training <- filter_selection[inTrain, ]
testing <- filter_selection[-inTrain, ]
head(training)

  MATCHID COMPETITION            TEAM1              TEAM2 GOALS1 GOALS2 RESULT EXPG1 EXPG2     DATUM     TIJD VERSCHIL
1 1696873  Pro League   Standard Liège Sporting Charleroi      3      0  TEAM1  1.57  0.61 25-7-2014 18:30:00     0.96
2 1696883  Pro League Waasland-Beveren        Club Brugge      0      2  TEAM2  1.29  1.18 26-7-2014 16:00:00     0.11
4 1696881  Pro League         Westerlo            Lokeren      1      0  TEAM1  1.76  1.24 26-7-2014 18:00:00     0.52
5 1696877  Pro League         Mechelen               Genk      3      1  TEAM1  1.60  1.23 27-7-2014 12:30:00     0.37
6 1696871  Pro League       Anderlecht  Mouscron-Péruwelz      3      1  TEAM1  1.27  0.62 27-7-2014 16:00:00     0.65