Question

我试图预测一些足球比赛的胜利类别。我有以下df：

head(df)
won               TEAM1              TEAM2  EXPG1   EXPG2 VAR1  VAR2
41   1            Bordeaux             Bastia 1.4200 0.93285  0.33  0.32
42   0                Caen             Rennes 1.3105 1.11580  0.44  0.43
43   2                Lens              Reims 1.3678 0.90433  0.34  0.43
44   2             Lorient           Guingamp 1.1773 0.96671  0.43  0.54
45   2 Olympique Marseille               Nice 1.3541 0.89154  0.53  0.54
46   2                Metz Olympique Lyonnais 1.1768 0.99026  0.53  0.61

不，我想要以下内容：

如果我包括除EXPG1和EXPG2
（如果我的预测更好） - 测试是否应包括OR VAR1或VAR2（哪个var获得更好的准确性。

我做了以下

#library(caret)
inTrain <- createDataPartition(y=df$won, p=0.7, list=FALSE)
training <- df[inTrain, ]
testing <- df[-inTrain, ]

#perform rf for EXPG1 and EXPG2
fit1 <- randomForest(won ~ EXPG1 + EXPG2, data=training, importance=TRUE, ntree=2000) 
Prediction1 <- predict(fit1, testing)
conf1 <- confusionMatrix(Prediction, testing$won)
#get accuracy

   #perform rf for EXPG1, EXPG2 and VAR1
fit1 <- randomForest(won ~ EXPG1 + EXPG2 + VAR1 , data=training, importance=TRUE, ntree=2000) 
Prediction1 <- predict(fit1, testing)
conf1 <- confusionMatrix(Prediction, testing$won)
#get accuracy


#perform rf for EXPG1, EXPG2 and VAR1
fit1 <- randomForest(won ~ EXPG1 + EXPG2 + VAR2 , data=training, importance=TRUE, ntree=2000) 
Prediction1 <- predict(fit1, testing)
conf1 <- confusionMatrix(Prediction, testing$won)
#get accuracy

概述VAR1 / VAR2是否有影响以及哪个因素可能会产生更好的影响。

我不确定这是否是正确的方法，因为当我运行此语句两次时（没有设置种子）：

     #perform rf for EXPG1 and EXPG2
fit1 <- randomForest(won ~ EXPG1 + EXPG2, data=training, importance=TRUE, ntree=2000) 
Prediction1 <- predict(fit1, testing)
conf1 <- confusionMatrix(Prediction, testing$won)
#get accuracy

我得到两个不同的准确度值（41％和43％）。

因此，如果我得到的结果是添加VAR1会使准确度提高1％，这可能是由于偶然（因为仅使用EXPG1和EXPG2运行算法两次可能已经相差2％）。

有关如何进行可靠测试以了解VAR1和VAR2的附加值的任何想法？

使用交叉验证来减少预测结果的变化

0 个答案: