我在h2o.glm函数中使用interaction参数时观察奇怪的行为。具体而言,虽然系数与基本R glm函数匹配,但预测却没有。给定几乎相同的系数,我预计几乎相同的预测。我在R中仔细运行了两个版本的glm,在h2o中运行了两个版本来演示以下这种行为。为什么来自h2o.glm模型的预测具有与其他glm预测不匹配的相互作用(尽管具有几乎相同的系数)?
以下是重现此行为的代码以及注释预测不匹配的注释,但是系数确实存在。
# Load libraries and ingest data.
library(h2o)
h2o.init()
infile <- "https://www.dropbox.com/s/itx2za2p63iez29/h2o_data2.csv?dl=1"
indf <- read.csv(infile, stringsAsFactors = FALSE)
indf$dow_x_hour <- paste(indf$dow, indf$hour)
indf[] <- lapply(indf[], as.factor)
str(indf)
# RESULT OF str(indf)
# 'data.frame': 8100 obs. of 4 variables:
# $ y : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
# $ dow : Factor w/ 3 levels "Fri","Sat","Sun": 1 1 1 2 2 2 3 3 3 1 ...
# $ hour : Factor w/ 3 levels "6","7","8": 1 2 3 1 2 3 1 2 3 1 ...
# $ dow_x_hour: Factor w/ 9 levels "Fri 6","Fri 7",..: 1 2 3 4 5 6 7 8 9 1 ...
hf <- as.h2o(indf)
## FIRST TRY R ----------------
# Fit glm with R using interactions.
r_glm1 <- glm(y ~ dow + hour + dow:hour,
family = "binomial",
data = indf)
# Fit glm with R using concatenated column.
r_glm2 <- glm(y ~ dow_x_hour,
family = "binomial",
data = indf)
# These two R models generate near-identical predictions.
# RESULT: 4.496403e-15
max(abs(predict(r_glm2, type = "response") - predict(r_glm1, type = "response")))
## NOW H2O ----------------
# Fit glm with h2o using interactions.
h2o_glm1 <- h2o.glm(2:3,
1,
hf,
solver = "IRLSM",
family = "binomial",
interactions = 2:3,
lambda_search = FALSE,
lambda = 0,
compute_p_values = TRUE)
# Fit glm with h2o using concatenated column.
h2o_glm2 <- h2o.glm(4,
1,
hf,
solver = "IRLSM",
family = "binomial",
lambda_search = FALSE,
lambda = 0,
compute_p_values = TRUE)
# These two H2O models do not generate the same predictions.
# RESULT: 0.06211734
max(abs(h2o.predict(h2o_glm1, hf)$p1 - h2o.predict(h2o_glm2, hf)$p1))
## COMPARE R VS H2O PREDICTIONS ----------------
# The R and h2o models using concatenated column produce near idential predictions.
# RESULT: 3.356773e-07
max(abs(predict(r_glm2, type = "response") - as.data.frame(h2o.predict(h2o_glm2, hf))$p1))
# The R and h2o models using interactions DO NOT produce near idential predictions.
# RESULT: 0.06211732
max(abs(predict(r_glm1, type = "response") - as.data.frame(h2o.predict(h2o_glm1, hf))$p1))
## COMPARE R VS H2O COEFFIICENTS ----------------
# The R and h2o models using interactions produce near idential coefficients
# (we manually matched them up here).
# RESULT: 3.341192e-06
df_coef <- cbind(h2o_glm1@model$coefficients_table, r_coef = coef(r_glm1)[c(1,6,8,7,9,2:5)])
max(abs(df_coef$coefficients - df_coef$r_coef))
答案 0 :(得分:2)
我认为问题在于h2o.predict
,它无法正确处理两个因素之间的相互作用。
在这里,我可以向您显示h2o.predict
错误地标记了您的互动条款。
pred1 <- as.data.frame(h2o.predict(h2o_glm1, hf))$p1
diff <- unname(predict(r_glm1, type = "link")) - log(pred1/(1-pred1))
pred1
是来自h2o_glm1
的响应预测,diff
计算r_glm1
和h2o_glm1
的链接预测之间的差异。链接预测只是输入数据与系数的线性组合。
在此之后,我们可以按dow
和hour
tapply(diff, list(indf$dow, indf$hour), mean)
# 6 7 8
# Fri -0.01645868 -0.01580134 -0.01580118
# Sat -0.01580673 0.01580109 -0.14319379
# Sun -0.01580207 -0.53848173 0.68233048
tapply(diff, list(indf$dow, indf$hour), sd)
# all 0
标准差均为0表示每个级别的预测差异是恒定的。这可以证明错误来自dow
和hour
的分组。
我们还可以进一步了解h2o.predict
如何标记互动字词。
在这里,我可以为dow
和hour
创建正确的系数矩阵:
coef <- h2o_glm1@model$coefficients
coef_M <- matrix(c(0,0,0,
0,coef[2],coef[3],
0,coef[4],coef[5]),3,byrow = TRUE)
# [,1] [,2] [,3]
# [1,] 0 0.00000000 0.00000000
# [2,] 0 0.01580116 -0.12739263
# [3,] 0 -0.66587428 0.01645634
然后从这个矩阵中减去差值,我们可以找到h2o.predict
使用的系数:
- tapply(diff, list(indf$dow, indf$hour), mean) + coef_M
# 6 7 8
# Fri 0.01645868 0.01580134 0.01580118
# Sat 0.01580673 6.854821e-08 0.01580116
# Sun 0.01580207 -0.1273926 -0.66587414
coef[2:5]
# dow_hour.Sat_7 dow_hour.Sat_8 dow_hour.Sun_7 dow_hour.Sun_8
# 0.01580116 -0.12739263 -0.66587428 0.01645634
这里我还列出了h2o_glm1
的交互系数项。您可以看到最后一个表中的所有值都与交互项的某些系数匹配,但与正确的系数不匹配。因此,h2o.predict
# 6 7 8
# Fri Sun_8 Sat_7 Sat_7
# Sat Sat_7 0 Sat_7
# Sun Sat_7 Sat_8 Sun_7
然而,h2o.predict
可以处理因子变量和数值变量或两个数值变量之间的相互作用。
如果我们将hour
从因素更改为数字:indf$hour = as.numeric(as.character(indf$hour))
并执行相同的建模过程。然后基数R glm
和h2o.glm
之间的差异很小:
indf2 <- indf
indf2$hour <- as.numeric(as.character(indf$hour))
r_glm3 <- glm(y ~ dow + hour + dow:hour,
family = "binomial",
data = indf2)
hf2 <- as.h2o(indf2)
h2o_glm3 <- h2o.glm(2:3,
1,
hf2,
solver = "IRLSM",
family = "binomial",
interactions = 2:3,
lambda_search = FALSE,
lambda = 0,
compute_p_values = TRUE)
max(abs(predict(r_glm3, type = "response") - as.data.frame(h2o.predict(h2o_glm3, hf2))$p1))
# 9.866078e-08
我认为,正如您在问题中所述,最佳替代解决方案是创建一个新的交互变量。这是另一种替代解决方案,使用h2o.interaction
生成此术语:
hf3 <- hf
hf3$dow_hour <- h2o.interaction(hf,factors = 2:3, pairwise = TRUE, max_factors = 100, min_occurrence = 1)
h2o_glm4 <- h2o.glm(5,
1,
hf3,
solver = "IRLSM",
family = "binomial",
lambda_search = FALSE,
lambda = 0,
compute_p_values = TRUE)
max(abs(predict(r_glm1, type = "response") - as.data.frame(h2o.predict(h2o_glm4, hf3))$p1))
# 3.356773e-07