我正在使用mboost
包进行一些分类。这是代码
library('mboost')
load('so-data.rdata')
model <- glmboost(is_exciting~., data=training, family=Binomial())
pred <- predict(model, newdata=test, type="response")
但R在做预测时抱怨
Error in scale.default(X, center = cm, scale = FALSE) :
length of 'center' must equal the number of columns of 'x'
可以在此处下载数据(training
和test
)(7z,zip)。
错误的原因是什么以及如何摆脱它?谢谢。
更新:
> str(training)
'data.frame': 439599 obs. of 24 variables:
$ is_exciting : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_state : Factor w/ 52 levels "AK","AL","AR",..: 15 5 5 23 47 5 44 42 42 5 ...
$ school_charter : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_magnet : Factor w/ 2 levels "f","t": 1 1 1 1 2 1 1 1 1 1 ...
$ school_year_round : Factor w/ 2 levels "f","t": 1 1 1 1 1 2 1 1 1 2 ...
$ school_nlns : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_charter_ready_promise : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ teacher_prefix : Factor w/ 6 levels "","Dr.","Mr.",..: 5 5 3 5 6 5 6 6 5 6 ...
$ teacher_teach_for_america : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 2 1 2 1 ...
$ teacher_ny_teaching_fellow : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ primary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 19 17 18 18 10 4 17 17 18 17 ...
$ primary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 6 5 5 5 5 4 5 5 5 5 ...
$ secondary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 28 18 17 19 26 18 18 28 24 25 ...
$ secondary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 7 5 5 6 8 5 5 7 7 4 ...
$ resource_type : Factor w/ 7 levels "","Books","Other",..: 4 4 2 5 5 2 2 5 5 5 ...
$ poverty_level : Factor w/ 4 levels "high poverty",..: 2 2 4 2 1 2 2 1 2 1 ...
$ grade_level : Factor w/ 5 levels "","Grades 3-5",..: 5 5 2 5 5 2 3 2 4 2 ...
$ fulfillment_labor_materials : num 30 35 35 30 30 35 30 35 35 35 ...
$ total_price_excluding_optional_support: num 1274 477 892 548 385 ...
$ total_price_including_optional_support: num 1499 562 1050 645 453 ...
$ students_reached : int 31 20 250 36 19 28 90 21 60 56 ...
$ eligible_double_your_impact_match : Factor w/ 2 levels "f","t": 1 2 1 2 1 2 1 1 1 1 ...
$ eligible_almost_home_match : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 2 2 1 1 ...
$ essay_length : int 236 285 194 351 383 273 385 437 476 159 ...
> str(test)
'data.frame': 44772 obs. of 23 variables:
$ school_state : Factor w/ 51 levels "AK","AL","AR",..: 22 35 11 46 5 35 11 28 28 10 ...
$ school_charter : Factor w/ 2 levels "f","t": 1 1 1 1 2 1 1 1 1 1 ...
$ school_magnet : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_year_round : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_nlns : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_charter_ready_promise : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ teacher_prefix : Factor w/ 6 levels "","Dr.","Mr.",..: 3 5 6 6 3 5 5 5 3 5 ...
$ teacher_teach_for_america : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ teacher_ny_teaching_fellow : Factor w/ 2 levels "f","t": 1 2 1 1 1 1 1 1 1 1 ...
$ primary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 5 16 17 17 18 11 16 17 2 17 ...
$ primary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 2 4 5 5 5 2 4 5 6 5 ...
$ secondary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 25 1 19 1 17 9 17 11 1 1 ...
$ secondary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 4 1 6 1 5 6 5 2 1 1 ...
$ resource_type : Factor w/ 7 levels "","Books","Other",..: 5 5 5 2 5 6 4 5 5 4 ...
$ poverty_level : Factor w/ 4 levels "high poverty",..: 1 2 4 4 1 2 2 2 1 2 ...
$ grade_level : Factor w/ 5 levels "","Grades 3-5",..: 4 3 3 5 4 5 5 4 3 5 ...
$ fulfillment_labor_materials : num 30 30 30 30 30 30 30 30 30 30 ...
$ total_price_excluding_optional_support: num 2185 149 1017 156 860 ...
$ total_price_including_optional_support: num 2571 175 1197 183 1012 ...
$ students_reached : int 200 110 10 22 180 51 30 15 260 20 ...
$ eligible_double_your_impact_match : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ eligible_almost_home_match : Factor w/ 2 levels "f","t": 2 1 1 1 1 1 1 1 2 1 ...
$ essay_length : int 221 137 313 243 373 344 304 431 231 173 ...
> summary(model)
Generalized Linear Models Fitted via Gradient Boosting
Call:
glmboost.formula(formula = is_exciting ~ ., data = training, family = Binomial())
Negative Binomial Likelihood
Loss function: {
f <- pmin(abs(f), 36) * sign(f)
p <- exp(f)/(exp(f) + exp(-f))
y <- (y + 1)/2
-y * log(p) - (1 - y) * log(1 - p)
}
Number of boosting iterations: mstop = 100
Step size: 0.1
Offset: -1.197806
Coefficients:
NOTE: Coefficients from a Binomial model are half the size of coefficients
from a model fitted via glm(... , family = 'binomial').
See Warning section in ?coef.mboost
(Intercept) school_stateDC
-0.5250166130 0.0426909965
school_stateIL school_chartert
0.0084191638 0.0729272310
teacher_prefixMrs. teacher_prefixMs.
-0.0181489492 0.0438425925
teacher_teach_for_americat resource_typeBooks
0.2593005345 0.0046126706
resource_typeTechnology fulfillment_labor_materials
-0.0313904871 0.0120086140
eligible_double_your_impact_matcht eligible_almost_home_matcht
-0.0316376431 -0.0522717398
essay_length
0.0004993224
attr(,"offset")
[1] -1.197806
Selection frequencies:
fulfillment_labor_materials teacher_teach_for_americat
0.24 0.15
essay_length school_chartert
0.15 0.09
teacher_prefixMs. resource_typeTechnology
0.08 0.07
eligible_double_your_impact_matcht eligible_almost_home_matcht
0.07 0.07
teacher_prefixMrs. school_stateDC
0.04 0.02
school_stateIL resource_typeBooks
0.01 0.01
我也试过glm
,但它说
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor teacher_prefix has new levels
但我在teacher_prefix
变量中没有看到任何新级别:
> levels(training$teacher_prefix)
[1] "" "Dr." "Mr." "Mr. & Mrs." "Mrs." "Ms."
> levels(test$teacher_prefix)
[1] "" "Dr." "Mr." "Mr. & Mrs." "Mrs." "Ms."
答案 0 :(得分:4)
实际上,glmboost
和glm
的问题是相关的。您的teacher_prefix
变量存在问题。
正如glm
示例所指出的那样,test
中的某些级别不在training
(种类)中。虽然两个因素都具有相同的levels()
,但是训练集没有观察teacher_prefix==""
,但测试确实如此。比较
table(test$teacher_prefix)
table(training$teacher_prefix)
因此glm
实际上提供了更准确,更有用的错误消息。问题与glmboost
的问题相同,尽管它并不直接说出来。
这样做似乎&#34;修复&#34;它
test2 <- subset(test, teacher_prefix %in% c("Dr.","Mr.","Mrs.","Ms."))
test2$teacher_prefix <- droplevels(test2$teacher_prefix)
pred <- predict(model, newdata=test2, type="response")
我们只是摆脱未使用的水平,然后进行标准预测。