我正在开发Coursera机器学习项目。目标是为以下数据集执行预测建模。
> summary(training)
roll_belt pitch_belt yaw_belt total_accel_belt gyros_belt_x
Min. :-28.90 Min. :-55.8000 Min. :-180.00 Min. : 0.00 Min. :-1.040000
1st Qu.: 1.10 1st Qu.: 1.7600 1st Qu.: -88.30 1st Qu.: 3.00 1st Qu.:-0.030000
Median :113.00 Median : 5.2800 Median : -13.00 Median :17.00 Median : 0.030000
Mean : 64.41 Mean : 0.3053 Mean : -11.21 Mean :11.31 Mean :-0.005592
3rd Qu.:123.00 3rd Qu.: 14.9000 3rd Qu.: 12.90 3rd Qu.:18.00 3rd Qu.: 0.110000
Max. :162.00 Max. : 60.3000 Max. : 179.00 Max. :29.00 Max. : 2.220000
gyros_belt_y gyros_belt_z accel_belt_x accel_belt_y accel_belt_z magnet_belt_x
Min. :-0.64000 Min. :-1.4600 Min. :-120.000 Min. :-69.00 Min. :-275.00 Min. :-52.0
1st Qu.: 0.00000 1st Qu.:-0.2000 1st Qu.: -21.000 1st Qu.: 3.00 1st Qu.:-162.00 1st Qu.: 9.0
Median : 0.02000 Median :-0.1000 Median : -15.000 Median : 35.00 Median :-152.00 Median : 35.0
Mean : 0.03959 Mean :-0.1305 Mean : -5.595 Mean : 30.15 Mean : -72.59 Mean : 55.6
3rd Qu.: 0.11000 3rd Qu.:-0.0200 3rd Qu.: -5.000 3rd Qu.: 61.00 3rd Qu.: 27.00 3rd Qu.: 59.0
Max. : 0.64000 Max. : 1.6200 Max. : 85.000 Max. :164.00 Max. : 105.00 Max. :485.0
magnet_belt_y magnet_belt_z roll_arm pitch_arm yaw_arm total_accel_arm
Min. :354.0 Min. :-623.0 Min. :-180.00 Min. :-88.800 Min. :-180.0000 Min. : 1.00
1st Qu.:581.0 1st Qu.:-375.0 1st Qu.: -31.77 1st Qu.:-25.900 1st Qu.: -43.1000 1st Qu.:17.00
Median :601.0 Median :-320.0 Median : 0.00 Median : 0.000 Median : 0.0000 Median :27.00
Mean :593.7 Mean :-345.5 Mean : 17.83 Mean : -4.612 Mean : -0.6188 Mean :25.51
3rd Qu.:610.0 3rd Qu.:-306.0 3rd Qu.: 77.30 3rd Qu.: 11.200 3rd Qu.: 45.8750 3rd Qu.:33.00
Max. :673.0 Max. : 293.0 Max. : 180.00 Max. : 88.500 Max. : 180.0000 Max. :66.00
gyros_arm_x gyros_arm_y gyros_arm_z accel_arm_x accel_arm_y
Min. :-6.37000 Min. :-3.4400 Min. :-2.3300 Min. :-404.00 Min. :-318.0
1st Qu.:-1.33000 1st Qu.:-0.8000 1st Qu.:-0.0700 1st Qu.:-242.00 1st Qu.: -54.0
Median : 0.08000 Median :-0.2400 Median : 0.2300 Median : -44.00 Median : 14.0
Mean : 0.04277 Mean :-0.2571 Mean : 0.2695 Mean : -60.24 Mean : 32.6
3rd Qu.: 1.57000 3rd Qu.: 0.1400 3rd Qu.: 0.7200 3rd Qu.: 84.00 3rd Qu.: 139.0
Max. : 4.87000 Max. : 2.8400 Max. : 3.0200 Max. : 437.00 Max. : 308.0
accel_arm_z magnet_arm_x magnet_arm_y magnet_arm_z roll_dumbbell pitch_dumbbell
Min. :-636.00 Min. :-584.0 Min. :-392.0 Min. :-597.0 Min. :-153.71 Min. :-149.59
1st Qu.:-143.00 1st Qu.:-300.0 1st Qu.: -9.0 1st Qu.: 131.2 1st Qu.: -18.49 1st Qu.: -40.89
Median : -47.00 Median : 289.0 Median : 202.0 Median : 444.0 Median : 48.17 Median : -20.96
Mean : -71.25 Mean : 191.7 Mean : 156.6 Mean : 306.5 Mean : 23.84 Mean : -10.78
3rd Qu.: 23.00 3rd Qu.: 637.0 3rd Qu.: 323.0 3rd Qu.: 545.0 3rd Qu.: 67.61 3rd Qu.: 17.50
Max. : 292.00 Max. : 782.0 Max. : 583.0 Max. : 694.0 Max. : 153.55 Max. : 149.40
yaw_dumbbell total_accel_dumbbell gyros_dumbbell_x gyros_dumbbell_y gyros_dumbbell_z
Min. :-150.871 Min. : 0.00 Min. :-204.0000 Min. :-2.10000 Min. : -2.380
1st Qu.: -77.644 1st Qu.: 4.00 1st Qu.: -0.0300 1st Qu.:-0.14000 1st Qu.: -0.310
Median : -3.324 Median :10.00 Median : 0.1300 Median : 0.03000 Median : -0.130
Mean : 1.674 Mean :13.72 Mean : 0.1611 Mean : 0.04606 Mean : -0.129
3rd Qu.: 79.643 3rd Qu.:19.00 3rd Qu.: 0.3500 3rd Qu.: 0.21000 3rd Qu.: 0.030
Max. : 154.952 Max. :58.00 Max. : 2.2200 Max. :52.00000 Max. :317.000
accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z magnet_dumbbell_x magnet_dumbbell_y
Min. :-419.00 Min. :-189.00 Min. :-334.00 Min. :-643.0 Min. :-3600
1st Qu.: -50.00 1st Qu.: -8.00 1st Qu.:-142.00 1st Qu.:-535.0 1st Qu.: 231
Median : -8.00 Median : 41.50 Median : -1.00 Median :-479.0 Median : 311
Mean : -28.62 Mean : 52.63 Mean : -38.32 Mean :-328.5 Mean : 221
3rd Qu.: 11.00 3rd Qu.: 111.00 3rd Qu.: 38.00 3rd Qu.:-304.0 3rd Qu.: 390
Max. : 235.00 Max. : 315.00 Max. : 318.00 Max. : 592.0 Max. : 633
magnet_dumbbell_z roll_forearm pitch_forearm yaw_forearm total_accel_forearm
Min. :-262.00 Min. :-180.0000 Min. :-72.50 Min. :-180.00 Min. : 0.00
1st Qu.: -45.00 1st Qu.: -0.7375 1st Qu.: 0.00 1st Qu.: -68.60 1st Qu.: 29.00
Median : 13.00 Median : 21.7000 Median : 9.24 Median : 0.00 Median : 36.00
Mean : 46.05 Mean : 33.8265 Mean : 10.71 Mean : 19.21 Mean : 34.72
3rd Qu.: 95.00 3rd Qu.: 140.0000 3rd Qu.: 28.40 3rd Qu.: 110.00 3rd Qu.: 41.00
Max. : 452.00 Max. : 180.0000 Max. : 89.80 Max. : 180.00 Max. :108.00
gyros_forearm_x gyros_forearm_y gyros_forearm_z accel_forearm_x accel_forearm_y
Min. :-22.000 Min. : -7.02000 Min. : -8.0900 Min. :-498.00 Min. :-632.0
1st Qu.: -0.220 1st Qu.: -1.46000 1st Qu.: -0.1800 1st Qu.:-178.00 1st Qu.: 57.0
Median : 0.050 Median : 0.03000 Median : 0.0800 Median : -57.00 Median : 201.0
Mean : 0.158 Mean : 0.07517 Mean : 0.1512 Mean : -61.65 Mean : 163.7
3rd Qu.: 0.560 3rd Qu.: 1.62000 3rd Qu.: 0.4900 3rd Qu.: 76.00 3rd Qu.: 312.0
Max. : 3.970 Max. :311.00000 Max. :231.0000 Max. : 477.00 Max. : 923.0
accel_forearm_z magnet_forearm_x magnet_forearm_y magnet_forearm_z classe
Min. :-446.00 Min. :-1280.0 Min. :-896.0 Min. :-973.0 A:5580
1st Qu.:-182.00 1st Qu.: -616.0 1st Qu.: 2.0 1st Qu.: 191.0 B:3797
Median : -39.00 Median : -378.0 Median : 591.0 Median : 511.0 C:3422
Mean : -55.29 Mean : -312.6 Mean : 380.1 Mean : 393.6 D:3216
3rd Qu.: 26.00 3rd Qu.: -73.0 3rd Qu.: 737.0 3rd Qu.: 653.0 E:3607
Max. : 291.00 Max. : 672.0 Max. :1480.0 Max. :1090.0
为了训练模型,我做了以下工作:
trainCtrl <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
rfModel <- train(classe ~., method = "rf", trControl = trainCtrl, preProcess = "pca", data = training, prox = TRUE)
模型有效。但是,我对多次警告信息感到非常恼火,重复多达20次,invalid mtry: reset to within valid range
。 Google上的一些搜索没有返回任何有用的见解。此外,不确定是否重要,数据集中没有NA值;它们在之前的步骤中被移除。
我也运行了system.time(),处理时间超过1小时。
> system.time(train(classe ~., method = "rf", trControl = trainCtrl, preProcess = "pca", data = training, prox = TRUE))
user system elapsed
6478.113 302.281 7044.483
如果你能帮助破译这个警告信息的内容和原因,那将是超级的。我很想听到有关这么长的处理时间的任何意见。
谢谢!
答案 0 :(得分:4)
caret
rf
方法使用randomForest
包中的randomForest
函数。如果将mtry
randomForest
参数设置为大于预测变量数的值,则会收到您发布的警告(例如,尝试rf = randomForest(mpg ~ ., mtry=15, data=mtcars)
)。该模型仍在运行,但randomForest
将mtry
设置为较低的有效值。
问题是,为什么train
(或其调用的功能之一)为randomForest
提供的mtry
值过大?我不确定,但是这里猜测:设置preProcess="pca"
可以减少馈送到randomForest
的功能数量(相对于原始数据中的功能数量),因为丢弃最不重要的主成分以减少特征集的维数。但是,在进行交叉验证时,train
可能会根据原始数据中较大数量的要素设置mtry
的最大randomForest
值,而不是基于实际馈送到randomForest
的预处理数据集。对此的一般证据是,如果您删除preProcess="pca"
参数,警告就会消失,但我没有进一步检查。
可重现的代码显示警告在没有pca的情况下消失:
trainCtrl <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
rfModel <- train(mpg ~., method = "rf", trControl = trainCtrl, preProcess = "pca", data = mtcars, prox = TRUE)
rfModel <- train(mpg ~., method = "rf", trControl = trainCtrl, data = mtcars, prox = TRUE)