以下脚本重现了h2o帮助(Help -> View Example Flow
或Help -> Browse Installed packs.. -> examples -> Airlines Delay.flow
,download)中所述的等效问题,但使用了h2o R-package和固定种子({ {1}}):
123456
这是训练集的混淆矩阵:
library(h2o)
# To use avaliable cores
h2o.init(max_mem_size = "12g", nthreads = -1)
IS_LOCAL_FILE = switch(1, FALSE, TRUE)
if (IS_LOCAL_FILE) {
data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = F)
allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex")
} else {
airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex")
}
response <- "IsDepDelayed"
predictors <- setdiff(names(allyears2k.hex), response)
# Copied and pasted from the flow, then converting to R syntax
predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime", "ArrTime", "CRSArrTime",
"TailNum", "ActualElapsedTime", "CRSElapsedTime",
"AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut",
"Cancelled", "CancellationCode", "Diverted", "CarrierDelay",
"WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay",
"IsArrDelayed")
predictors <- setdiff(predictors, predictors.exc)
# Convert to factor for classification
allyears2k.hex[, response] <- as.factor(allyears2k.hex[, response])
# Copied and pasted from the flow, then converting to R syntax
fit1 <- h2o.glm(
x = predictors,
model_id="glm_model", seed=123456, training_frame=allyears2k.hex,
ignore_const_cols = T, y = response,
family="binomial", solver="IRLSM",
alpha=0.5,lambda=0.00001, lambda_search=F, standardize=T,
non_negative=F, score_each_iteration=F,
max_iterations=-1, link="family_default", intercept=T, objective_epsilon=0.00001,
beta_epsilon=0.0001, gradient_epsilon=0.0001, prior=-1, max_active_predictors=-1
)
# Analysis
confMatrix <- h2o.confusionMatrix(fit1)
print("Confusion Matrix for training dataset")
print(confMatrix)
print(summary(fit1))
h2o.shutdown()
指标:
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
NO YES Error Rate
NO 0 20887 1.000000 =20887/20887
YES 0 23091 0.000000 =0/23091
Totals 0 43978 0.474942 =20887/43978
相反,h2o流的结果具有更好的性能:
h2o流量性能比使用等效R-package函数运行相同算法要好得多。
注意:为了简单起见,我使用航空公司延迟问题,这是使用h2o的一个众所周知的问题,但我意识到在使用{的其他类似情况下会发现这种显着差异{1}}算法。
任何关于为什么会出现这些显着差异的想法
附录A:使用默认模型参数
根据@DarrenCook的建议回答,只使用默认的建筑参数,但不包括列和种子:
h2o flow
现在H2OBinomialMetrics: glm
** Reported on training data. **
MSE: 0.2473858
RMSE: 0.4973789
LogLoss: 0.6878898
Mean Per-Class Error: 0.5
AUC: 0.5550138
Gini: 0.1100276
R^2: 0.007965165
Residual Deviance: 60504.04
AIC: 60516.04
被调用如下:
glm
}
结果是:
和培训指标:
运行R-Script
以下脚本允许轻松切换到默认配置(通过buildModel
变量),并保持配置符合航空公司延迟示例中的说明:
buildModel 'glm', {"model_id":"glm_model-default",
"seed":"123456","training_frame":"allyears2k.hex",
"ignored_columns":
["DayofMonth","DepTime","CRSDepTime","ArrTime","CRSArrTime","TailNum",
"ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
"TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted",
"CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
"LateAircraftDelay","IsArrDelayed"],
"response_column":"IsDepDelayed","family":"binomial"
它产生以下结果:
IS_DEFAULT_MODEL
有些指标很接近,但混淆矩阵非常不同,R-Script预测所有航班都会延迟。
附录B:配置
library(h2o)
h2o.init(max_mem_size = "12g", nthreads = -1) # To use avaliable cores
IS_LOCAL_FILE = switch(2, FALSE, TRUE)
IS_DEFAULT_MODEL = switch(2, FALSE, TRUE)
if (IS_LOCAL_FILE) {
data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = F)
allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex")
} else {
airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex")
}
response <- "IsDepDelayed"
predictors <- setdiff(names(allyears2k.hex), response)
# Copied and pasted from the flow, then converting to R syntax
predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime", "ArrTime", "CRSArrTime",
"TailNum", "ActualElapsedTime", "CRSElapsedTime",
"AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut",
"Cancelled", "CancellationCode", "Diverted", "CarrierDelay",
"WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay",
"IsArrDelayed")
predictors <- setdiff(predictors, predictors.exc)
# Convert to factor for classification
allyears2k.hex[, response] <- as.factor(allyears2k.hex[, response])
if (IS_DEFAULT_MODEL) {
fit1 <- h2o.glm(
x = predictors, model_id = "glm_model", seed = 123456,
training_frame = allyears2k.hex, y = response, family = "binomial"
)
} else { # Copied and pasted from the flow, then converting to R syntax
fit1 <- h2o.glm(
x = predictors,
model_id = "glm_model", seed = 123456, training_frame = allyears2k.hex,
ignore_const_cols = T, y = response,
family = "binomial", solver = "IRLSM",
alpha = 0.5, lambda = 0.00001, lambda_search = F, standardize = T,
non_negative = F, score_each_iteration = F,
max_iterations = -1, link = "family_default", intercept = T, objective_epsilon = 0.00001,
beta_epsilon = 0.0001, gradient_epsilon = 0.0001, prior = -1, max_active_predictors = -1
)
}
# Analysis
confMatrix <- h2o.confusionMatrix(fit1)
print("Confusion Matrix for training dataset")
print(confMatrix)
print(summary(fit1))
h2o.shutdown()
注意:我在3.19.0.4231下测试了R-Script,结果相同
这是运行R:
后的群集信息MSE: 0.2473859
RMSE: 0.497379
LogLoss: 0.6878898
Mean Per-Class Error: 0.5
AUC: 0.5549898
Gini: 0.1099796
R^2: 0.007964984
Residual Deviance: 60504.04
AIC: 60516.04
Confusion Matrix (vertical: actual; across: predicted)
for F1-optimal threshold:
NO YES Error Rate
NO 0 20887 1.000000 =20887/20887
YES 0 23091 0.000000 =0/23091
Totals 0 43978 0.474942 =20887/43978
答案 0 :(得分:2)
疑难解答提示:首先构建全默认模型:
mDef = h2o.glm(predictors, response, allyears2k.hex, family="binomial")
这需要2秒钟,并且与流动截图中的AlUC和混淆矩阵完全相同。
所以,我们现在知道你看到的问题是由于你所做的所有模型定制......
...除非我构建你的fit1
,我得到的结果与我的默认模型基本相同:
NO YES Error Rate
NO 4276 16611 0.795279 =16611/20887
YES 1573 21518 0.068122 =1573/23091
Totals 5849 38129 0.413479 =18184/43978
这完全按照给定的方式使用您的脚本,因此它获取了远程csv文件。 (哦,我删除了max_mem_size参数,因为我在这个笔记本上没有12g!)
假设您可以准确地获得发布的结果,运行您发布的代码(以及新的R会话,使用新启动的H2O群集),一种可能的解释是您使用的是3.19.x,但是最新的稳定版本是3.18.0.2? (我的测试是3.14.0.1)
答案 1 :(得分:0)
最后,我想这是解释:两者都有相同的参数配置来构建模型(这不是问题),但H2o流使用特定的解析自定义将一些变量值转换为Enum
, R脚本没有指定。
航空公司延迟问题如何在h2o Flow示例中指定它用作预测变量(流程定义了ignored_columns):
"Year", "Month", "DayOfWeek", "UniqueCarrier",
"FlightNum", "Origin", "Dest", "Distance"
除非Enum
,否则应将所有预测变量解析为:Distance
。因此,R-Script需要将此类列从numeric
或char
转换为factor
。
使用h2o R-package执行
这里更新了R-Script:
library(h2o)
h2o.init(max_mem_size = "12g", nthreads = -1) # To use avaliable cores
IS_LOCAL_FILE = switch(2, FALSE, TRUE)
IS_DEFAULT_MODEL = switch(2, FALSE, TRUE)
if (IS_LOCAL_FILE) {
data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = T)
allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex")
} else {
airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex")
}
response <- "IsDepDelayed"
predictors <- setdiff(names(allyears2k.hex), response)
# Copied and pasted from the flow, then converting to R syntax
predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime",
"ArrTime", "CRSArrTime",
"TailNum", "ActualElapsedTime", "CRSElapsedTime",
"AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut",
"Cancelled", "CancellationCode", "Diverted", "CarrierDelay",
"WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay",
"IsArrDelayed")
predictors <- setdiff(predictors, predictors.exc)
column.asFactor <- c("Year", "Month", "DayofMonth", "DayOfWeek",
"UniqueCarrier", "FlightNum", "Origin", "Dest", response)
# Coercing as factor (equivalent to Enum from h2o Flow)
# Note: Using lapply does not work, see the answer of this question
# https://stackoverflow.com/questions/49393343/how-to-coerce-multiple-columns-to-factors-at-once-for-h2oframe-object
for (col in column.asFactor) {
allyears2k.hex[col] <- as.factor(allyears2k.hex[col])
}
if (IS_DEFAULT_MODEL) {
fit1 <- h2o.glm(x = predictors, y = response,
training_frame = allyears2k.hex,
family = "binomial", seed = 123456
)
} else { # Copied and pasted from the flow, then converting to R syntax
fit1 <- h2o.glm(
x = predictors,
model_id = "glm_model", seed = 123456,
training_frame = allyears2k.hex,
ignore_const_cols = T, y = response,
family = "binomial", solver = "IRLSM",
alpha = 0.5, lambda = 0.00001, lambda_search = F, standardize = T,
non_negative = F, score_each_iteration = F,
max_iterations = -1, link = "family_default", intercept = T,
objective_epsilon = 0.00001,
beta_epsilon = 0.0001, gradient_epsilon = 0.0001, prior = -1,
max_active_predictors = -1
)
}
# Analysis
print("Confusion Matrix for training dataset")
confMatrix <- h2o.confusionMatrix(fit1)
print(confMatrix)
print(summary(fit1))
h2o.shutdown()
这是在默认配置IS_DEFAULT_MODEL=T
下运行R-Script的结果:
H2OBinomialMetrics: glm
** Reported on training data. **
MSE: 0.2001145
RMSE: 0.4473416
LogLoss: 0.5845852
Mean Per-Class Error: 0.3343562
AUC: 0.7570867
Gini: 0.5141734
R^2: 0.1975266
Residual Deviance: 51417.77
AIC: 52951.77
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
NO YES Error Rate
NO 10337 10550 0.505099 =10550/20887
YES 3778 19313 0.163614 =3778/23091
Totals 14115 29863 0.325799 =14328/43978
在h2o流程下执行
现在执行流程:Airlines_Delay_GLMFixedSeed,我们可以获得相同的结果。这里有关于流程配置的详细信息:
parseFiles
功能:
parseFiles
paths: ["https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"]
destination_frame: "allyears2k.hex"
parse_type: "CSV"
separator: 44
number_columns: 31
single_quotes: false
column_names:
["Year","Month","DayofMonth","DayOfWeek","DepTime","CRSDepTime","ArrTime",
"CRSArrTime","UniqueCarrier","FlightNum","TailNum","ActualElapsedTime",
"CRSElapsedTime","AirTime","ArrDelay","DepDelay","Origin","Dest",
"Distance","TaxiIn","TaxiOut","Cancelled","CancellationCode",
"Diverted","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
"LateAircraftDelay","IsArrDelayed",
"IsDepDelayed"]
column_types ["Enum","Enum","Enum","Enum","Numeric","Numeric",
"Numeric","Numeric", "Enum","Enum","Enum","Numeric",
"Numeric", "Numeric","Numeric","Numeric",
"Enum","Enum","Numeric","Numeric","Numeric",
"Enum","Enum","Numeric","Numeric","Numeric",
"Numeric","Numeric","Numeric","Enum","Enum"]
delete_on_done: true
check_header: 1
chunk_size: 4194304
将以下预测变量列转换为Enum
:"Year", "Month", "DayOfWeek", "UniqueCarrier", "FlightNum", "Origin", "Dest"
现在使用除buildModel
和ignored_columns
以外的默认参数调用seed
函数,如下所示:
buildModel 'glm', {"model_id":"glm_model-default","seed":"123456",
"training_frame":"allyears2k.hex",
"ignored_columns":["DayofMonth","DepTime","CRSDepTime","ArrTime",
"CRSArrTime","TailNum",
"ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
"TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted",
"CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
"LateAircraftDelay","IsArrDelayed"],"response_column":"IsDepDelayed",
"family":"binomial"}
最后我们得到以下结果:
和培训输出指标:
model glm_model-default
model_checksum -2438376548367921152
frame allyears2k.hex
frame_checksum -2331137066674151424
description ·
model_category Binomial
scoring_time 1521598137667
predictions ·
MSE 0.200114
RMSE 0.447342
nobs 43978
custom_metric_name ·
custom_metric_value 0
r2 0.197527
logloss 0.584585
AUC 0.757084
Gini 0.514168
mean_per_class_error 0.334347
residual_deviance 51417.772427
null_deviance 60855.951538
AIC 52951.772427
null_degrees_of_freedom 43977
residual_degrees_of_freedom 43211
比较两个结果
前4位有效数字的培训指标几乎相同:
R-Script H2o Flow
MSE: 0.2001145 0.200114
RMSE: 0.4473416 0.447342
LogLoss: 0.5845852 0.584585
Mean Per-Class Error: 0.3343562 0.334347
AUC: 0.7570867 0.757084
Gini: 0.5141734 0.514168
R^2: 0.1975266 0.197527
Residual Deviance: 51417.77 51417.772427
AIC: 52951.77 52951.772427
混淆矩阵略有不同:
TP TN FP FN
R-Script 10337 19313 10550 3778
H2o Flow 10341 19309 10546 3782
Error
R-Script 0.325799
H2o Flow 0.3258
我的理解是差异在于可接受的阈值(大约0.0001
),因此我们可以说两个接口都提供相同的结果。