使用h2o包重现航空公司延迟h2o流示例与

时间:2018-03-16 11:14:45

标签: r h2o

以下脚本重现了h2o帮助(Help -> View Example FlowHelp -> Browse Installed packs.. -> examples -> Airlines Delay.flowdownload)中所述的等效问题,但使用了h2o R-package和固定种子({ {1}}):

123456

这是训练集的混淆矩阵:

library(h2o)
# To use avaliable cores
h2o.init(max_mem_size = "12g", nthreads = -1)

IS_LOCAL_FILE = switch(1, FALSE, TRUE)
if (IS_LOCAL_FILE) {
    data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = F)
    allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex")
} else {
    airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
    allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex")
}

response <- "IsDepDelayed"
predictors <- setdiff(names(allyears2k.hex), response)

# Copied and pasted from the flow, then converting to R syntax
predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime", "ArrTime", "CRSArrTime",
    "TailNum", "ActualElapsedTime", "CRSElapsedTime",
    "AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut",
    "Cancelled", "CancellationCode", "Diverted", "CarrierDelay",
    "WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay",
    "IsArrDelayed")

predictors <- setdiff(predictors, predictors.exc)
# Convert to factor for classification
allyears2k.hex[, response] <- as.factor(allyears2k.hex[, response])

# Copied and pasted from the flow, then converting to R syntax
fit1 <- h2o.glm(
    x = predictors,
    model_id="glm_model", seed=123456, training_frame=allyears2k.hex,
    ignore_const_cols = T, y = response,
    family="binomial", solver="IRLSM",
    alpha=0.5,lambda=0.00001, lambda_search=F, standardize=T,
    non_negative=F, score_each_iteration=F,
    max_iterations=-1, link="family_default", intercept=T, objective_epsilon=0.00001,
    beta_epsilon=0.0001, gradient_epsilon=0.0001, prior=-1, max_active_predictors=-1
)
# Analysis
confMatrix <- h2o.confusionMatrix(fit1)
print("Confusion Matrix for training dataset")
print(confMatrix)
print(summary(fit1))
h2o.shutdown()

指标:

 Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
       NO   YES    Error          Rate
NO      0 20887 1.000000  =20887/20887
YES     0 23091 0.000000      =0/23091
Totals  0 43978 0.474942  =20887/43978

相反,h2o流的结果具有更好的性能:

training Metrics for max f1 threshold

和最大f1阈值的混淆矩阵: Confusion Matrix

h2o流量性能比使用等效R-package函数运行相同算法要好得多。

注意:为了简单起见,我使用航空公司延迟问题,这是使用h2o的一个众所周知的问题,但我意识到在使用{的其他类似情况下会发现这种显着差异{1}}算法。

任何关于为什么会出现这些显着差异的想法

附录A:使用默认模型参数

根据@DarrenCook的建议回答,只使用默认的建筑参数,但不包括列和种子:

h2o flow

现在H2OBinomialMetrics: glm ** Reported on training data. ** MSE: 0.2473858 RMSE: 0.4973789 LogLoss: 0.6878898 Mean Per-Class Error: 0.5 AUC: 0.5550138 Gini: 0.1100276 R^2: 0.007965165 Residual Deviance: 60504.04 AIC: 60516.04 被调用如下:

glm

}

结果是:

/ROC curve and parameters for max f1 criterion

和培训指标:

enter image description here

运行R-Script

以下脚本允许轻松切换到默认配置(通过buildModel变量),并保持配置符合航空公司延迟示例中的说明:

buildModel 'glm', {"model_id":"glm_model-default",
  "seed":"123456","training_frame":"allyears2k.hex",
  "ignored_columns": 
     ["DayofMonth","DepTime","CRSDepTime","ArrTime","CRSArrTime","TailNum",
      "ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
      "TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted",
      "CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
      "LateAircraftDelay","IsArrDelayed"],
   "response_column":"IsDepDelayed","family":"binomial"

它产生以下结果:

IS_DEFAULT_MODEL

有些指标很接近,但混淆矩阵非常不同,R-Script预测所有航班都会延迟。

附录B:配置

library(h2o)
h2o.init(max_mem_size = "12g", nthreads = -1) # To use avaliable cores

IS_LOCAL_FILE    = switch(2, FALSE, TRUE)
IS_DEFAULT_MODEL = switch(2, FALSE, TRUE)
if (IS_LOCAL_FILE) {
    data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = F)
    allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex")
} else {
    airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
    allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex")
}

response <- "IsDepDelayed"
predictors <- setdiff(names(allyears2k.hex), response)

# Copied and pasted from the flow, then converting to R syntax
predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime", "ArrTime", "CRSArrTime",
    "TailNum", "ActualElapsedTime", "CRSElapsedTime",
    "AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut",
    "Cancelled", "CancellationCode", "Diverted", "CarrierDelay",
    "WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay",
    "IsArrDelayed")

predictors <- setdiff(predictors, predictors.exc)
# Convert to factor for classification
allyears2k.hex[, response] <- as.factor(allyears2k.hex[, response])

if (IS_DEFAULT_MODEL) {
    fit1 <- h2o.glm(
        x = predictors, model_id = "glm_model", seed = 123456,
        training_frame = allyears2k.hex, y = response, family = "binomial"
    )
} else { # Copied and pasted from the flow, then converting to R syntax
    fit1 <- h2o.glm(
        x = predictors,
        model_id = "glm_model", seed = 123456, training_frame = allyears2k.hex,
        ignore_const_cols = T, y = response,
        family = "binomial", solver = "IRLSM",
        alpha = 0.5, lambda = 0.00001, lambda_search = F, standardize = T,
        non_negative = F, score_each_iteration = F,
        max_iterations = -1, link = "family_default", intercept = T, objective_epsilon = 0.00001,
        beta_epsilon = 0.0001, gradient_epsilon = 0.0001, prior = -1, max_active_predictors = -1
    )
}

# Analysis
confMatrix <- h2o.confusionMatrix(fit1)
print("Confusion Matrix for training dataset")
print(confMatrix)
print(summary(fit1))
h2o.shutdown()

注意:我在3.19.0.4231下测试了R-Script,结果相同

这是运行R:

后的群集信息
MSE:  0.2473859
RMSE:  0.497379
LogLoss:  0.6878898
Mean Per-Class Error:  0.5
AUC:  0.5549898
Gini:  0.1099796
R^2:  0.007964984
Residual Deviance:  60504.04
AIC:  60516.04

Confusion Matrix (vertical: actual; across: predicted) 
for F1-optimal threshold:
       NO   YES    Error          Rate
NO      0 20887 1.000000  =20887/20887
YES     0 23091 0.000000      =0/23091
Totals  0 43978 0.474942  =20887/43978

2 个答案:

答案 0 :(得分:2)

疑难解答提示:首先构建全默认模型:

mDef = h2o.glm(predictors, response, allyears2k.hex, family="binomial")

这需要2秒钟,并且与流动截图中的AlUC和混淆矩阵完全相同。

所以,我们现在知道你看到的问题是由于你所做的所有模型定制......

...除非我构建你的fit1,我得到的结果与我的默认模型基本相同:

         NO   YES    Error          Rate
NO     4276 16611 0.795279  =16611/20887
YES    1573 21518 0.068122   =1573/23091
Totals 5849 38129 0.413479  =18184/43978

这完全按照给定的方式使用您的脚本,因此它获取了远程csv文件。 (哦,我删除了max_mem_size参数,因为我在这个笔记本上没有12g!)

假设您可以准确地获得发布的结果,运行您发布的代码(以及新的R会话,使用新启动的H2O群集),一种可能的解释是您使用的是3.19.x,但是最新的稳定版本是3.18.0.2? (我的测试是3.14.0.1)

答案 1 :(得分:0)

最后,我想这是解释:两者都有相同的参数配置来构建模型(这不是问题),但H2o流使用特定的解析自定义将一些变量值转换为Enum, R脚本没有指定。

航空公司延迟问题如何在h2o Flow示例中指定它用作预测变量(流程定义了ignored_columns):

"Year", "Month", "DayOfWeek", "UniqueCarrier", 
   "FlightNum", "Origin", "Dest", "Distance"

除非Enum,否则应将所有预测变量解析为:Distance。因此,R-Script需要将此类列从numericchar转换为factor

使用h2o R-package执行

这里更新了R-Script:

library(h2o)
h2o.init(max_mem_size = "12g", nthreads = -1) # To use avaliable cores

IS_LOCAL_FILE    = switch(2, FALSE, TRUE)
IS_DEFAULT_MODEL = switch(2, FALSE, TRUE)
if (IS_LOCAL_FILE) {
    data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = T)
    allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex")
} else {
    airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
    allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex")
}

response <- "IsDepDelayed"
predictors <- setdiff(names(allyears2k.hex), response)

# Copied and pasted from the flow, then converting to R syntax
predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime", 
    "ArrTime", "CRSArrTime",
    "TailNum", "ActualElapsedTime", "CRSElapsedTime",
    "AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut",
    "Cancelled", "CancellationCode", "Diverted", "CarrierDelay",
    "WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay",
    "IsArrDelayed")

predictors <- setdiff(predictors, predictors.exc)
column.asFactor <- c("Year", "Month", "DayofMonth", "DayOfWeek", 
    "UniqueCarrier",  "FlightNum", "Origin", "Dest", response)
# Coercing as factor (equivalent to Enum from h2o Flow)
# Note: Using lapply does not work, see the answer of this question
# https://stackoverflow.com/questions/49393343/how-to-coerce-multiple-columns-to-factors-at-once-for-h2oframe-object
for (col in column.asFactor) {
    allyears2k.hex[col] <- as.factor(allyears2k.hex[col])
}

if (IS_DEFAULT_MODEL) {
    fit1 <- h2o.glm(x = predictors, y = response, 
       training_frame = allyears2k.hex,
       family = "binomial", seed = 123456
    )
} else { # Copied and pasted from the flow, then converting to R syntax
    fit1 <- h2o.glm(
        x = predictors,
        model_id = "glm_model", seed = 123456, 
        training_frame = allyears2k.hex,
        ignore_const_cols = T, y = response,
        family = "binomial", solver = "IRLSM",
        alpha = 0.5, lambda = 0.00001, lambda_search = F, standardize = T,
        non_negative = F, score_each_iteration = F,
        max_iterations = -1, link = "family_default", intercept = T,
        objective_epsilon = 0.00001,
        beta_epsilon = 0.0001, gradient_epsilon = 0.0001, prior = -1,
        max_active_predictors = -1
    )
}

# Analysis
print("Confusion Matrix for training dataset")
confMatrix <- h2o.confusionMatrix(fit1)
print(confMatrix)
print(summary(fit1))
h2o.shutdown()

这是在默认配置IS_DEFAULT_MODEL=T下运行R-Script的结果:

H2OBinomialMetrics: glm
** Reported on training data. **

MSE:                   0.2001145
RMSE:                  0.4473416
LogLoss:               0.5845852
Mean Per-Class Error:  0.3343562
AUC:                   0.7570867
Gini:                  0.5141734
R^2:                   0.1975266
Residual Deviance:     51417.77
AIC:                   52951.77

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
          NO   YES    Error          Rate
NO     10337 10550 0.505099  =10550/20887
YES     3778 19313 0.163614   =3778/23091
Totals 14115 29863 0.325799  =14328/43978

在h2o流程下执行

现在执行流程:Airlines_Delay_GLMFixedSeed,我们可以获得相同的结果。这里有关于流程配置的详细信息:

parseFiles功能:

parseFiles
  paths: ["https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"]
  destination_frame: "allyears2k.hex"
  parse_type: "CSV"
  separator: 44
  number_columns: 31
  single_quotes: false
  column_names: 
  ["Year","Month","DayofMonth","DayOfWeek","DepTime","CRSDepTime","ArrTime",
   "CRSArrTime","UniqueCarrier","FlightNum","TailNum","ActualElapsedTime",
   "CRSElapsedTime","AirTime","ArrDelay","DepDelay","Origin","Dest",
   "Distance","TaxiIn","TaxiOut","Cancelled","CancellationCode",
   "Diverted","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
   "LateAircraftDelay","IsArrDelayed",
   "IsDepDelayed"]
  column_types ["Enum","Enum","Enum","Enum","Numeric","Numeric",
   "Numeric","Numeric", "Enum","Enum","Enum","Numeric",
   "Numeric", "Numeric","Numeric","Numeric",
   "Enum","Enum","Numeric","Numeric","Numeric",
   "Enum","Enum","Numeric","Numeric","Numeric",
   "Numeric","Numeric","Numeric","Enum","Enum"]
  delete_on_done: true
  check_header: 1
  chunk_size: 4194304

将以下预测变量列转换为Enum"Year", "Month", "DayOfWeek", "UniqueCarrier", "FlightNum", "Origin", "Dest"

现在使用除buildModelignored_columns以外的默认参数调用seed函数,如下所示:

 buildModel 'glm', {"model_id":"glm_model-default","seed":"123456",
  "training_frame":"allyears2k.hex",
  "ignored_columns":["DayofMonth","DepTime","CRSDepTime","ArrTime",
  "CRSArrTime","TailNum",
  "ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
  "TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted",
  "CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
  "LateAircraftDelay","IsArrDelayed"],"response_column":"IsDepDelayed",
  "family":"binomial"}

最后我们得到以下结果:

Confusion matrix for max f1 treshold

和培训输出指标:

model                   glm_model-default
model_checksum          -2438376548367921152
frame                   allyears2k.hex
frame_checksum          -2331137066674151424
description             ·
model_category          Binomial
scoring_time            1521598137667
predictions             ·
MSE                     0.200114
RMSE                    0.447342
nobs                    43978
custom_metric_name      ·
custom_metric_value     0
r2                      0.197527
logloss                 0.584585
AUC                     0.757084
Gini                    0.514168
mean_per_class_error    0.334347
residual_deviance       51417.772427
null_deviance           60855.951538
AIC                     52951.772427
null_degrees_of_freedom 43977
residual_degrees_of_freedom 43211

比较两个结果

前4位有效数字的培训指标几乎相同:

                       R-Script   H2o Flow
MSE:                   0.2001145  0.200114
RMSE:                  0.4473416  0.447342
LogLoss:               0.5845852  0.584585
Mean Per-Class Error:  0.3343562  0.334347
AUC:                   0.7570867  0.757084
Gini:                  0.5141734  0.514168
R^2:                   0.1975266  0.197527
Residual Deviance:     51417.77   51417.772427
AIC:                   52951.77   52951.772427

混淆矩阵略有不同:

          TP     TN    FP    FN   
R-Script  10337  19313 10550 3778
H2o Flow  10341  19309 10546 3782

          Error
R-Script  0.325799  
H2o Flow  0.3258

我的理解是差异在于可接受的阈值(大约0.0001),因此我们可以说两个接口都提供相同的结果。