h2o.glm系数与R glm匹配,但预测不匹配

时间:2016-12-08 22:25:18

标签: r glm h2o

我在h2o.glm函数中使用interaction参数时观察奇怪的行为。具体而言,虽然系数与基本R glm函数匹配,但预测却没有。给定几乎相同的系数,我预计几乎相同的预测。我在R中仔细运行了两个版本的glm,在h2o中运行了两个版本来演示以下这种行为。为什么来自h2o.glm模型的预测具有与其他glm预测不匹配的相互作用(尽管具有几乎相同的系数)?

以下是重现此行为的代码以及注释预测不匹配的注释,但是系数确实存在。

# Load libraries and ingest data.
library(h2o)
h2o.init()
infile <- "https://www.dropbox.com/s/itx2za2p63iez29/h2o_data2.csv?dl=1"
indf <- read.csv(infile, stringsAsFactors = FALSE)
indf$dow_x_hour <- paste(indf$dow, indf$hour)
indf[] <- lapply(indf[], as.factor)
str(indf)
# RESULT OF str(indf)
# 'data.frame': 8100 obs. of  4 variables:
# $ y         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
# $ dow       : Factor w/ 3 levels "Fri","Sat","Sun": 1 1 1 2 2 2 3 3 3 1 ...
# $ hour      : Factor w/ 3 levels "6","7","8": 1 2 3 1 2 3 1 2 3 1 ...
# $ dow_x_hour: Factor w/ 9 levels "Fri 6","Fri 7",..: 1 2 3 4 5 6 7 8 9 1 ...
hf <- as.h2o(indf)


## FIRST TRY R ----------------
# Fit glm with R using interactions.
r_glm1 <- glm(y ~ dow + hour + dow:hour,
              family = "binomial",
              data = indf)

# Fit glm with R using concatenated column.
r_glm2 <- glm(y ~ dow_x_hour,
              family = "binomial",
              data = indf)

# These two R models generate near-identical predictions.
# RESULT: 4.496403e-15
max(abs(predict(r_glm2, type = "response") - predict(r_glm1, type = "response")))


## NOW H2O ----------------
# Fit glm with h2o using interactions.
h2o_glm1 <- h2o.glm(2:3,
                    1,
                    hf,
                    solver = "IRLSM",
                    family = "binomial",
                    interactions = 2:3,
                    lambda_search = FALSE,
                    lambda = 0,
                    compute_p_values = TRUE)

# Fit glm with h2o using concatenated column.
h2o_glm2 <- h2o.glm(4,
                    1,
                    hf,
                    solver = "IRLSM",
                    family = "binomial",
                    lambda_search = FALSE,
                    lambda = 0,
                    compute_p_values = TRUE)

# These two H2O models do not generate the same predictions.
# RESULT: 0.06211734
max(abs(h2o.predict(h2o_glm1, hf)$p1 - h2o.predict(h2o_glm2, hf)$p1))


## COMPARE R VS H2O PREDICTIONS ----------------

# The R and h2o models using concatenated column produce near idential predictions.
# RESULT: 3.356773e-07
max(abs(predict(r_glm2, type = "response") - as.data.frame(h2o.predict(h2o_glm2, hf))$p1))

# The R and h2o models using interactions DO NOT produce near idential predictions.
# RESULT: 0.06211732
max(abs(predict(r_glm1, type = "response") - as.data.frame(h2o.predict(h2o_glm1, hf))$p1))


## COMPARE R VS H2O COEFFIICENTS ----------------

# The R and h2o models using interactions produce near idential coefficients 
# (we manually matched them up here).
# RESULT: 3.341192e-06
df_coef <- cbind(h2o_glm1@model$coefficients_table, r_coef = coef(r_glm1)[c(1,6,8,7,9,2:5)])
max(abs(df_coef$coefficients - df_coef$r_coef))

1 个答案:

答案 0 :(得分:2)

我认为问题在于h2o.predict,它无法正确处理两个因素之间的相互作用。

找到问题

在这里,我可以向您显示h2o.predict错误地标记了您的互动条款。

pred1 <- as.data.frame(h2o.predict(h2o_glm1, hf))$p1
diff <- unname(predict(r_glm1, type = "link")) - log(pred1/(1-pred1))

pred1是来自h2o_glm1的响应预测,diff计算r_glm1h2o_glm1链接预测之间的差异。链接预测只是输入数据与系数的线性组合。

在此之后,我们可以按dowhour

创建差异表
tapply(diff, list(indf$dow, indf$hour), mean)
#               6           7           8
# Fri -0.01645868 -0.01580134 -0.01580118
# Sat -0.01580673  0.01580109 -0.14319379
# Sun -0.01580207 -0.53848173  0.68233048
tapply(diff, list(indf$dow, indf$hour), sd)
# all 0

标准差均为0表示每个级别的预测差异是恒定的。这可以证明错误来自dowhour的分组。

我们还可以进一步了解h2o.predict如何标记互动字词。 在这里,我可以为dowhour创建正确的系数矩阵:

coef <- h2o_glm1@model$coefficients
coef_M <- matrix(c(0,0,0,
                   0,coef[2],coef[3],
                   0,coef[4],coef[5]),3,byrow = TRUE)

#      [,1]        [,2]        [,3]
# [1,]    0  0.00000000  0.00000000
# [2,]    0  0.01580116 -0.12739263
# [3,]    0 -0.66587428  0.01645634

然后从这个矩阵中减去差值,我们可以找到h2o.predict使用的系数:

- tapply(diff, list(indf$dow, indf$hour), mean) + coef_M
#              6           7             8
# Fri 0.01645868  0.01580134    0.01580118
# Sat 0.01580673  6.854821e-08  0.01580116
# Sun 0.01580207 -0.1273926    -0.66587414

coef[2:5]
# dow_hour.Sat_7 dow_hour.Sat_8 dow_hour.Sun_7 dow_hour.Sun_8 
#     0.01580116    -0.12739263    -0.66587428     0.01645634 

这里我还列出了h2o_glm1的交互系数项。您可以看到最后一个表中的所有值都与交互项的某些系数匹配,但与正确的系数不匹配。因此,h2o.predict

中两个因素的相互作用的匹配是不正确的
#          6       7        8
# Fri  Sun_8   Sat_7    Sat_7
# Sat  Sat_7       0    Sat_7
# Sun  Sat_7   Sat_8    Sun_7

更改为数字

然而,h2o.predict可以处理因子变量和数值变量或两个数值变量之间的相互作用。

如果我们将hour从因素更改为数字:indf$hour = as.numeric(as.character(indf$hour))并执行相同的建模过程。然后基数R glmh2o.glm之间的差异很小:

indf2 <- indf
indf2$hour <- as.numeric(as.character(indf$hour))
r_glm3 <- glm(y ~ dow + hour + dow:hour,
          family = "binomial",
          data = indf2)
hf2 <- as.h2o(indf2)
h2o_glm3 <- h2o.glm(2:3,
                    1,
                    hf2,
                    solver = "IRLSM",
                    family = "binomial",
                    interactions = 2:3,
                    lambda_search = FALSE,
                    lambda = 0,
                    compute_p_values = TRUE)
max(abs(predict(r_glm3, type = "response") - as.data.frame(h2o.predict(h2o_glm3, hf2))$p1))
# 9.866078e-08

创建互动条款

我认为,正如您在问题中所述,最佳替代解决方案是创建一个新的交互变量。这是另一种替代解决方案,使用h2o.interaction生成此术语:

hf3 <- hf
hf3$dow_hour <- h2o.interaction(hf,factors = 2:3, pairwise = TRUE, max_factors = 100, min_occurrence = 1)

h2o_glm4 <- h2o.glm(5,
                    1,
                    hf3,
                    solver = "IRLSM",
                    family = "binomial",
                    lambda_search = FALSE,
                    lambda = 0,
                    compute_p_values = TRUE)
max(abs(predict(r_glm1, type = "response") - as.data.frame(h2o.predict(h2o_glm4, hf3))$p1))
# 3.356773e-07