我在R中使用h2o.randomforest在2组上建立一个分类器,组“A”和&组“B”。例如,我随机生成了一个样本数据集,如下所示,并将其转换为h2oframe:
a <- sample(0:1,10000,replace=T)
b <- sample(0:1,10000,replace=T)
c <- sample(1:10,10000,replace=T)
d <- sample(0:1,10000,replace=T)
e <- sample(0:1,10000,replace=T)
f <- sample(0:1,10000,replace=T)
基本上,它们将被分解并且都具有2个级别,除了c,其具有10个级别。前5000行被分配标签为“A”而其余的被分配标签“B”。另外,我有另一个名为nlabel的专栏,前5000行为“B”,其余为“A”。
以下是我的数据集的前10行和最后10行:
a b c d e f label nlabel
1 0 0 5 0 1 0 A B
2 0 1 5 1 1 1 A B
3 0 0 6 0 0 1 A B
4 0 0 8 0 0 1 A B
5 1 1 1 1 1 1 A B
6 1 1 6 1 0 1 A B
7 1 0 3 1 1 1 A B
8 1 1 9 1 0 1 A B
9 1 0 8 1 0 1 A B
10 0 0 1 0 1 1 A B
.............
9991 1 1 3 0 0 1 B A
9992 0 0 7 1 0 0 B A
9993 1 0 9 0 1 1 B A
9994 0 1 3 0 0 0 B A
9995 1 1 8 0 1 0 B A
9996 0 1 8 0 1 0 B A
9997 1 1 9 0 1 0 B A
9998 0 0 5 1 0 1 B A
9999 0 1 9 1 1 0 B A
10000 0 1 10 1 0 1 B A
由于我随机生成了数据集,但除了我可以得到一个好的分类器(或者我可能是世界上最幸运的人)之外我没有。我除了随机猜测之外的东西。这是我在R:
中使用“randomForest”包得到的结果 > rf <- randomForest(label ~ a + b + c + e + f,
+ data = test,
ntree = 100)
> rf
Call:
randomForest(formula = label ~ a + b + c + e + f, data = test, ntree = 100)
Type of random forest: classification
Number of trees: 100
No. of variables tried at each split: 2
OOB estimate of error rate: 50.17%
Confusion matrix:
A B class.error
A 2507 2493 0.4986
B 2524 2476 0.5048
然而,通过使用具有相同数据集的h2o.randomforest,我得到了不同的结果。以下是我使用的代码和我得到的结果:
> TEST <- as.h2o(test)
> rfh2o <- h2o.randomForest(y = "label",
x = c("a","b",
"c","d",
"e","f"),
training_frame = TEST,
ntrees = 100)
> rfh2o
Model Details:
==============
H2OBinomialModel: drf
Model ID: DRF_model_R_1501015614001_1029
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves
1 100 100 366582 7 14 11.33000 1
max_leaves mean_leaves
1 319 286.52000
H2OBinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
MSE: 0.2574374
RMSE: 0.5073829
LogLoss: 0.7086906
Mean Per-Class Error: 0.5
AUC: 0.4943865
Gini: -0.01122696
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
A B Error Rate
A 0 5000 1.000000 =5000/5000
B 0 5000 0.000000 =0/5000
Totals 0 10000 0.500000 =5000/10000
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.231771 0.666667 399
2 max f2 0.231771 0.833333 399
3 max f0point5 0.231771 0.555556 399
4 max accuracy 0.459704 0.506800 251
5 max precision 0.723654 0.593750 10
6 max recall 0.231771 1.000000 399
7 max specificity 0.785389 0.999800 0
8 max absolute_mcc 0.288276 0.051057 389
9 max min_per_class_accuracy 0.500860 0.488000 200
10 max mean_per_class_accuracy 0.459704 0.506800 251
Based on the result above, the confusion matrix is different from what I got from "randomForest" package.
此外,如果我使用"nlabel"
代替"label"
和h2o.randomforest,我在预测A时仍然会有很高的错误率。但在目前的模型中,A与最后一个模型中的B相同。这是我得到的代码和结果:
> rfh2o_n <- h2o.randomForest(y = "nlabel",
+ x = c("a","b",
+ "c","d",
+ "e","f"),
+ training_frame = TEST,
+ ntrees = 100)
> rfh2o_n
Model Details:
==============
H2OBinomialModel: drf
Model ID: DRF_model_R_1501015614001_1113
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves
1 100 100 365232 11 14 11.18000 1
max_leaves mean_leaves
1 319 285.42000
H2OBinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
MSE: 0.2575674
RMSE: 0.507511
LogLoss: 0.7089465
Mean Per-Class Error: 0.5
AUC: 0.4923496
Gini: -0.01530088
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
A B Error Rate
A 0 5000 1.000000 =5000/5000
B 0 5000 0.000000 =0/5000
Totals 0 10000 0.500000 =5000/10000
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.214495 0.666667 399
2 max f2 0.214495 0.833333 399
3 max f0point5 0.214495 0.555556 399
4 max accuracy 0.617230 0.506600 74
5 max precision 0.621806 0.541833 70
6 max recall 0.214495 1.000000 399
7 max specificity 0.749866 0.999800 0
8 max absolute_mcc 0.733630 0.042465 6
9 max min_per_class_accuracy 0.499186 0.486400 201
10 max mean_per_class_accuracy 0.617230 0.506600 74
这种结果让我想知道标签是否在h2o.randomforest中起作用。 我不经常使用h2o,但上面的结果让我很困惑。这只是由于概率,或者我只是犯了一些愚蠢的错误,还是其他什么?
答案 0 :(得分:0)
我认为这是因为,由于数据完全是随机的,因此默认情况下H2O使用的max-f1统计数据不会产生有用的值。
如果强制阈值为0.5,如下所示,您将获得预期的行为。
另外,如果你打开H2O Flow并查看训练模型的ROC曲线,它很糟糕且几乎是一条直线(如你所料)。
library(data.table)
library(h2o)
a <- sample(0:1,10000,replace=T)
b <- sample(0:1,10000,replace=T)
c <- sample(1:10,10000,replace=T)
d <- sample(0:1,10000,replace=T)
e <- sample(0:1,10000,replace=T)
f <- sample(0:1,10000,replace=T)
df = data.frame(a, b, c, d, e, f)
dt = as.data.table(df)
dt[1:5000, label := "A"]
dt[5001:10000, label := "B"]
dt$label = as.factor(dt$label)
dt
h2o.init()
h2o_dt <- as.h2o(dt)
model = h2o.randomForest(y = "label",
x = c("a", "b", "c", "d", "e", "f"),
training_frame = h2o_dt,
ntrees = 10,
model_id = "model")
model
h2o_preds = h2o.predict(model, h2o_dt)
preds = as.data.table(h2o_preds)
preds[, prediction := A > 0.5]
table(preds$prediction)
最终输出是:
FALSE TRUE
5085 4915
你可以多次重新运行它,看到值随机反弹,但每组大约5000个。