我有一个数据集daf
中包含的数据集,我根据日期分为训练和测试数据。在20090000
以下的日期训练并测试上面的日期。为此,我们将原始数据框拆分为daf_train
和daf_test
。
我正在使用GLM
,并在模型daf$city
中有一个因素。出现的问题是daf_test
有时会包含daf_train
中未见到的新城市。
我认为解决这个问题的最佳方法是做一些像
这样的事情levels(daf_train$city) = levels(daf$city)
预先警告所有可能的城市。
我希望GLM
能够认识到,对于之前没有见过的城市,请考虑城市的因子系数的平均值。如果所有先前因子的系数均为零,我认为这样就足够了。
我如何更改代码来执行此操作
mylogit = glm(Y ~ X + factor(city), data=daf_train, family=binomial(link='logit'))
predictions = predict(mylogit, daf_test, type='response')
注意,这是一个非常丑陋且不通用的方法(我也是R的新手,所以也许这会弄乱GLM对象)
cityLevels = levels(factor(daf$city))
daf_train$city = factor(daf_train$city, cityLevels)
# daf_train$city now has all the levels of the overall dataset
# But if we train a GLM now, it will ignore any levels without observations
# Instead we split the factor into binary variables
train_data = cbind(daf_train, model.matrix( ~ 0 + city, daf_train))
# Remove the factor variable
train_data$city = NULL
# Now train the GLM
mylogit = glm(Y ~., data = train_data, family=binomial(link='logit'))
# This gives us coefficient values for all factors in the training set
# Any factors not in the training set get coefficient values of NA
# Finally we must convert the factor coefficients to have zero mean
offset = mean(mylogit$coefficients[-1:-34])
mylogit$coefficients[-1:-34] = mylogit$coefficients[-1:-34] - offset
mylogit$coefficients[1] = mylogit$coefficients[1] + offset
# Yeuch, this required us to know where in our coefficients vector our cities started (34)
答案 0 :(得分:1)
我认为这就是你要找的东西。还是很难看,但我认为它比你的代码更通用。让我知道它是否错过了标记,我可以修改或删除。
# dummy data
set.seed(321)
daf_train <- data.frame(x = runif(100, min=10, max=50),
y = runif(100),
city = sample(c("city1", "city2", "city3"), size=100, replace=TRUE))
set.seed(321)
daf_test <- data.frame(x = runif(30, min=10, max=50),
y = runif(30),
city = sample(c("city1", "city2", "city3", "city4"), size=30, replace=TRUE))
daf_train$city <- factor(daf_train$city, levels=levels(daf_test$city))
# cities in test set but not train set
(newcity <- sort(unique(daf_test$city))[!sort(unique(daf_test$city)) %in% unique(daf_train$city)])
[1] city4
Levels: city1 city2 city3 city4
# fit model with city1, city2, city3
xreg <- cbind(x=daf_train$x, model.matrix(~ 0 + city, data=daf_train))
mylogit = glm(y ~ xreg, data=daf_train, family=binomial(link='logit'))
newxreg <- cbind(x=daf_test$x, model.matrix(~ 0 + city, data=daf_test))
# mean of city coefficients
if (length(newcity) > 0) {
# coefficients from model
citycoef <- coef(mylogit)[grepl("city", names(coef(mylogit)))]
# calculate coefficient for new city(cities)
citycoef_offset <- mean(citycoef, na.rm=TRUE)
# repeat for all new cities
citycoef[is.na(citycoef)] <- citycoef_offset
# center coefficients
citycoef <- scale(citycoef, center=TRUE, scale=FALSE)[, 1]
# replace city coefficients from model
modelcoef <- coef(mylogit)
# add offset to intercept
modelcoef[["(Intercept)"]] <- modelcoef[["(Intercept)"]] + citycoef_offset
# all new coefficients
modelcoef[match(names(citycoef), names(modelcoef))] <- citycoef
# Beta0 + Beta1x...
pcoef <- modelcoef[["(Intercept)"]] +
newxreg %*%
modelcoef[!names(modelcoef) == "(Intercept)"]
#predicted response
predictions <- unlist(lapply(pcoef, function(x) exp(x) / (1 + exp(x))))
} else {
predictions <- predict(mylogit, daf_test, type="response")
}