我有以下数据集:
head(filter_selection)
MATCHID COMPETITION TEAM1 TEAM2 GOALS1 GOALS2 RESULT EXPG1 EXPG2 DATUM TIJD VERSCHIL
1 1696873 Pro League Standard Liège Sporting Charleroi 3 0 TEAM1 1.57 0.61 25-7-2014 18:30:00 0.96
2 1696883 Pro League Waasland-Beveren Club Brugge 0 2 TEAM2 1.29 1.18 26-7-2014 16:00:00 0.11
3 1696879 Pro League Lierse KV Oostende 2 0 TEAM1 1.03 1.04 26-7-2014 18:00:00 -0.01
4 1696881 Pro League Westerlo Lokeren 1 0 TEAM1 1.76 1.24 26-7-2014 18:00:00 0.52
5 1696877 Pro League Mechelen Genk 3 1 TEAM1 1.60 1.23 27-7-2014 12:30:00 0.37
6 1696871 Pro League Anderlecht Mouscron-Péruwelz 3 1 TEAM1 1.27 0.62 27-7-2014 16:00:00 0.65
我想使用VERSCHIL值来预测RESULT。因此,我执行以下操作来创建测试/培训集:
library(rcaret)
inTrain <- createDataPartition(y=filter_selection$RESULT, p=0.75, list=FALSE)
然而,当我这样做时,我的RESULT列会改变:
training <- df_final_test[inTrain, ]
testing <- df_final_test[-inTrain, ]
head(training, 20)
MATCHID COMPETITION TEAM1 TEAM2 GOALS1 GOALS2 RESULT EXPG1 EXPG2 DATUM TIJD VERSCHIL CLAS type TYPE TYPE2
1 1696873 Pro League Standard Liège Sporting Charleroi 3 0 3 1.57 0.61 25-7-2014 18:30:00 0.96 0.96 TBD (-0.0767,1.54] HIGH
2 1696883 Pro League Waasland-Beveren Club Brugge 0 2 4 1.29 1.18 26-7-2014 16:00:00 0.11 0.11 TBD (-0.0767,1.54] MEDIUM
现在分别是3和4而不是TEAM1和TEAM2。谁能告诉我为什么TEAM1值变成了3?
当我对垃圾邮件数据集执行相同操作时,它的奇怪原因
data(spam)
inTrain <- createDataPartition(y=spam$type, p=0.75, list=FALSE)
training <- spam[inTrain, ]
head(training)
考虑到这些类是相同的
class(spam$type)
[1] "factor"
class(filter_selection$RESULT)
[1] "factor"
答案 0 :(得分:0)
首先,没有包装rcaret。 其次,在“filter_selection”上创建数据分区,然后根据不同的数据框“df_final_test”创建训练和测试集。
但请检查df_final_test $ RESULT的结构,看看该因子有多少级别。也许那里出了点问题。如果有任何级别,您不想使用droplevels(df_final_test$RESULT)
如果我尝试使用filter_selection上的代码并创建一个训练集,我会得到一个正确的训练和测试集。
library(caret)
inTrain <- createDataPartition(y=filter_selection$RESULT, p=0.75, list=FALSE)
training <- filter_selection[inTrain, ]
testing <- filter_selection[-inTrain, ]
head(training)
MATCHID COMPETITION TEAM1 TEAM2 GOALS1 GOALS2 RESULT EXPG1 EXPG2 DATUM TIJD VERSCHIL
1 1696873 Pro League Standard Liège Sporting Charleroi 3 0 TEAM1 1.57 0.61 25-7-2014 18:30:00 0.96
2 1696883 Pro League Waasland-Beveren Club Brugge 0 2 TEAM2 1.29 1.18 26-7-2014 16:00:00 0.11
4 1696881 Pro League Westerlo Lokeren 1 0 TEAM1 1.76 1.24 26-7-2014 18:00:00 0.52
5 1696877 Pro League Mechelen Genk 3 1 TEAM1 1.60 1.23 27-7-2014 12:30:00 0.37
6 1696871 Pro League Anderlecht Mouscron-Péruwelz 3 1 TEAM1 1.27 0.62 27-7-2014 16:00:00 0.65