Question

我有一个数据集daf中包含的数据集，我根据日期分为训练和测试数据。在20090000以下的日期训练并测试上面的日期。为此，我们将原始数据框拆分为daf_train和daf_test。

我正在使用GLM，并在模型daf$city中有一个因素。出现的问题是daf_test有时会包含daf_train中未见到的新城市。

我认为解决这个问题的最佳方法是做一些像

这样的事情

levels(daf_train$city) = levels(daf$city)

预先警告所有可能的城市。

我希望GLM能够认识到，对于之前没有见过的城市，请考虑城市的因子系数的平均值。如果所有先前因子的系数均为零，我认为这样就足够了。

我如何更改代码来执行此操作

mylogit = glm(Y ~ X + factor(city), data=daf_train, family=binomial(link='logit'))
predictions = predict(mylogit, daf_test, type='response')

注意，这是一个非常丑陋且不通用的方法（我也是R的新手，所以也许这会弄乱GLM对象）

cityLevels = levels(factor(daf$city))
daf_train$city = factor(daf_train$city, cityLevels)

# daf_train$city now has all the levels of the overall dataset 
# But if we train a GLM now, it will ignore any levels without observations

# Instead we split the factor into binary variables
train_data = cbind(daf_train, model.matrix( ~ 0 + city, daf_train))
# Remove the factor variable
train_data$city = NULL

# Now train the GLM
mylogit = glm(Y ~., data = train_data, family=binomial(link='logit'))

# This gives us coefficient values for all factors in the training set
# Any factors not in the training set get coefficient values of NA

# Finally we must convert the factor coefficients to have zero mean
offset = mean(mylogit$coefficients[-1:-34])
mylogit$coefficients[-1:-34] = mylogit$coefficients[-1:-34] - offset
mylogit$coefficients[1] = mylogit$coefficients[1] + offset

# Yeuch, this required us to know where in our coefficients vector our cities started (34)

Answer 1

我认为这就是你要找的东西。还是很难看，但我认为它比你的代码更通用。让我知道它是否错过了标记，我可以修改或删除。

# dummy data
set.seed(321)
daf_train <- data.frame(x = runif(100, min=10, max=50), 
                        y = runif(100), 
                        city = sample(c("city1", "city2", "city3"), size=100, replace=TRUE))

set.seed(321)
daf_test <- data.frame(x = runif(30, min=10, max=50),
                       y = runif(30),
                       city = sample(c("city1", "city2", "city3", "city4"), size=30, replace=TRUE))

daf_train$city <- factor(daf_train$city, levels=levels(daf_test$city))


# cities in test set but not train set
(newcity <- sort(unique(daf_test$city))[!sort(unique(daf_test$city)) %in% unique(daf_train$city)])
[1] city4
Levels: city1 city2 city3 city4

# fit model with city1, city2, city3
xreg <- cbind(x=daf_train$x, model.matrix(~ 0 + city, data=daf_train))

mylogit = glm(y ~ xreg, data=daf_train, family=binomial(link='logit'))

newxreg <- cbind(x=daf_test$x, model.matrix(~ 0 + city, data=daf_test))

# mean of city coefficients
if (length(newcity) > 0) {

  # coefficients from model
  citycoef <- coef(mylogit)[grepl("city", names(coef(mylogit)))]

  # calculate coefficient for new city(cities)
  citycoef_offset <- mean(citycoef, na.rm=TRUE)

  # repeat for all new cities
  citycoef[is.na(citycoef)] <- citycoef_offset 

  # center coefficients
  citycoef <- scale(citycoef, center=TRUE, scale=FALSE)[, 1]

  # replace city coefficients from model
  modelcoef <- coef(mylogit)

  # add offset to intercept
  modelcoef[["(Intercept)"]] <- modelcoef[["(Intercept)"]] + citycoef_offset

  # all new coefficients
  modelcoef[match(names(citycoef), names(modelcoef))] <- citycoef

  # Beta0 + Beta1x...
  pcoef <- modelcoef[["(Intercept)"]] + 
    newxreg %*%
    modelcoef[!names(modelcoef) == "(Intercept)"]

  #predicted response
  predictions <- unlist(lapply(pcoef, function(x) exp(x) / (1 + exp(x))))


} else {
  predictions <- predict(mylogit, daf_test, type="response")
}

处理R GLM中的未知因子级别

1 个答案: