Question

我有一个训练集，x列表示正在进行比赛的特定体育场。显然，这些列在训练集中是线性相关的，因为必须在至少一个体育场内进行匹配。

然而我遇到的问题是如果我传入测试数据，它可能包括训练数据中未见的体育场。因此，我想在训练R glm中包括所有x列，使得每个体育场系数的平均值为零。然后，如果看到一个新的体育场，它将基本上给出所有体育场系数的平均值。

我遇到的问题是R glm函数似乎检测到我的训练集中有线性依赖列，并将其中一个系数设置为NA，以使其余的线性独立。我如何：

停止R在glm函数中为我的一列插入NA系数并确保所有体育场系数总和为0？

一些示例代码

# Past observations
outcome   = c(1  ,0  ,0  ,1  ,0  ,1  ,0  ,0  ,1  ,0  ,1  )
skill     = c(0.1,0.5,0.6,0.3,0.1,0.3,0.9,0.6,0.5,0.1,0.4)
stadium_1 = c(1  ,1  ,0  ,0  ,0  ,0  ,0  ,0  ,0  ,0  ,0  )
stadium_2 = c(0  ,0  ,1  ,1  ,1  ,1  ,1  ,0  ,0  ,0  ,0  )
stadium_3 = c(0  ,0  ,0  ,0  ,0  ,0  ,0  ,1  ,1  ,1  ,1  )

train_glm_data = data.frame(outcome, skill, stadium_1, stadium_2,     stadium_3)
LR = glm(outcome ~ . - outcome, data = train_glm_data,  family=binomial(link='logit'))
print(predict(LR, type = 'response'))

# New observations (for a new stadium we have not seen before)
skill     = c(0.1)
stadium_1 = c(0  )
stadium_2 = c(0  )
stadium_3 = c(0  )

test_glm_data = data.frame(outcome, skill, stadium_1, stadium_2, stadium_3)
print(predict(LR, test_glm_data, type = 'response'))

# Note that in this case, the observation is simply the same as if we had observed stadium_3
# Instead I would like it to be an average of all the known stadiums coefficients
# If they all sum to 0 this is essentially already done for me
# However if not then the stadium_3 coefficient is buried somewhere in the intercept term

Answer 1

      Id Gender   Age Participate    Q1   Q10    Q2    Q3    Q4
*  <int>  <chr> <int>       <int> <chr> <chr> <chr> <chr> <chr>
1     16   Male    20           1     0     1     0     1     1
2     17   Male    40           1     1     0     0     0     0
3     18   Male    33           1     1     0     0     0     0
4     19   Male    18           1     1     0     0     0     0
5     20   Male    24           1     0     0     1     0     0
6     21 Female    42           1     0     0     1     0     0
7     22 Female    19           1     1     0     0     1     1
8     28 Female    49           1     0     1     1     0     0
9     29 Female    17           1     1     0     1     0     0
10    31   Male    18           1     1     0     1     0     0

关于如何为所有级别包含系数的问题 - 不要执行此操作。它被称为 虚拟变量陷阱 。如果不排除参考水平，则数据矩阵变为单数。

唯一的例外是你估计一个no-intercept模型。阅读有关虚拟变量陷阱here.

的更多信息

Answer 2

要估算所有虚拟变量的系数，可以在公式中添加“-1”，这将删除截距：

train_glm_data$stadium <- NA
train_glm_data$stadium[train_glm_data$stadium_1==1] <- "Stadium 1"
train_glm_data$stadium[train_glm_data$stadium_2==1] <- "Stadium 2"
train_glm_data$stadium[train_glm_data$stadium_3==1] <- "Stadium 3"
train_glm_data$stadium_1 <- NULL
train_glm_data$stadium_2 <- NULL
train_glm_data$stadium_3 <- NULL

train_glm_data$stadium         <- as.factor(train_glm_data$stadium)
levels(train_glm_data$stadium) <- c("Stadium 1", "Stadium 2", "Stadium 3", "Stadium 4")
train_glm_data                 <- rbind(train_glm_data, c(
                                      round(mean(outcome)), mean(skill),
                                      "Stadium 4"
                                    ))
train_glm_data$outcome <- as.numeric(train_glm_data$outcome)
train_glm_data$skill   <- as.numeric(train_glm_data$skill)
LR = glm(outcome ~ stadium + skill, data = train_glm_data,  family=binomial(link='logit'))
print(predict(LR, type = 'response'))

# New observations (for a new stadium we have not seen before)
skill     = c(0.1)
stadium   = "Stadium 4"

test_glm_data = data.frame(skill, stadium)
print(predict(LR, test_glm_data, type = 'response'))

系数：

LR = glm(outcome ~ . - outcome - 1, data = train_glm_data, family=binomial(link='logit'))

对于看不见的训练水平问题，@ hack-r提出了一些好主意。另一个想法是为新观察的所有虚拟变量估算coef(LR) # skill stadium_1 stadium_2 stadium_3 # -2.8080177 0.8424053 0.7541226 1.1313135（其中1/n是观察到的体育场馆的数量）。

在glm中包含线性相关的特征

2 个答案: