Question

我在R中使用h2o.randomforest在2组上建立一个分类器，组“A”和＆amp;组“B”。例如，我随机生成了一个样本数据集，如下所示，并将其转换为h2oframe：

    a <- sample(0:1,10000,replace=T)
    b <- sample(0:1,10000,replace=T)
    c <- sample(1:10,10000,replace=T)
    d <- sample(0:1,10000,replace=T)
    e <- sample(0:1,10000,replace=T)
    f <- sample(0:1,10000,replace=T)

基本上，它们将被分解并且都具有2个级别，除了c，其具有10个级别。前5000行被分配标签为“A”而其余的被分配标签“B”。另外，我有另一个名为nlabel的专栏，前5000行为“B”，其余为“A”。

以下是我的数据集的前10行和最后10行：

          a b  c d e f label nlabel
    1     0 0  5 0 1 0     A      B
    2     0 1  5 1 1 1     A      B
    3     0 0  6 0 0 1     A      B
    4     0 0  8 0 0 1     A      B
    5     1 1  1 1 1 1     A      B
    6     1 1  6 1 0 1     A      B
    7     1 0  3 1 1 1     A      B
    8     1 1  9 1 0 1     A      B
    9     1 0  8 1 0 1     A      B
    10    0 0  1 0 1 1     A      B
    .............
    9991  1 1  3 0 0 1     B      A
    9992  0 0  7 1 0 0     B      A
    9993  1 0  9 0 1 1     B      A
    9994  0 1  3 0 0 0     B      A
    9995  1 1  8 0 1 0     B      A
    9996  0 1  8 0 1 0     B      A
    9997  1 1  9 0 1 0     B      A
    9998  0 0  5 1 0 1     B      A
    9999  0 1  9 1 1 0     B      A
    10000 0 1 10 1 0 1     B      A

由于我随机生成了数据集，但除了我可以得到一个好的分类器（或者我可能是世界上最幸运的人）之外我没有。我除了随机猜测之外的东西。这是我在R：

中使用“randomForest”包得到的结果

    > rf <- randomForest(label ~ a + b + c + e + f, 
    +                            data = test, 
                                 ntree = 100)
    > rf

        Call:
         randomForest(formula = label ~ a + b + c + e + f, data = test,      ntree = 100) 
                       Type of random forest: classification
                             Number of trees: 100
        No. of variables tried at each split: 2

                OOB estimate of  error rate: 50.17%
        Confusion matrix:
             A    B class.error
        A 2507 2493      0.4986
        B 2524 2476      0.5048

然而，通过使用具有相同数据集的h2o.randomforest，我得到了不同的结果。以下是我使用的代码和我得到的结果：

        > TEST <- as.h2o(test)
        > rfh2o <- h2o.randomForest(y = "label",
                                  x = c("a","b",
                                        "c","d",
                                        "e","f"),
                                  training_frame = TEST,
                                  ntrees = 100) 
    > rfh2o
    Model Details:
    ==============

    H2OBinomialModel: drf
    Model ID:  DRF_model_R_1501015614001_1029 
    Model Summary: 
      number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves
    1             100                      100              366582         7        14   11.33000          1
      max_leaves mean_leaves
    1        319   286.52000


    H2OBinomialMetrics: drf
    ** Reported on training data. **
    ** Metrics reported on Out-Of-Bag training samples **

    MSE:  0.2574374
    RMSE:  0.5073829
    LogLoss:  0.7086906
    Mean Per-Class Error:  0.5
    AUC:  0.4943865
    Gini:  -0.01122696

    Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
           A     B    Error         Rate
    A      0  5000 1.000000   =5000/5000
    B      0  5000 0.000000      =0/5000
    Totals 0 10000 0.500000  =5000/10000

    Maximum Metrics: Maximum metrics at their respective thresholds
                            metric threshold    value idx
    1                       max f1  0.231771 0.666667 399
    2                       max f2  0.231771 0.833333 399
    3                 max f0point5  0.231771 0.555556 399
    4                 max accuracy  0.459704 0.506800 251
    5                max precision  0.723654 0.593750  10
    6                   max recall  0.231771 1.000000 399
    7              max specificity  0.785389 0.999800   0
    8             max absolute_mcc  0.288276 0.051057 389
    9   max min_per_class_accuracy  0.500860 0.488000 200
    10 max mean_per_class_accuracy  0.459704 0.506800 251

Based on the result above, the confusion matrix is different from what I got from "randomForest" package.

此外，如果我使用"nlabel"代替"label"和h2o.randomforest，我在预测A时仍然会有很高的错误率。但在目前的模型中，A与最后一个模型中的B相同。这是我得到的代码和结果：

> rfh2o_n <- h2o.randomForest(y = "nlabel",
+                           x = c("a","b",
+                                 "c","d",
+                                 "e","f"),
+                           training_frame = TEST,
+                           ntrees = 100)

> rfh2o_n
Model Details:
==============

H2OBinomialModel: drf
Model ID:  DRF_model_R_1501015614001_1113 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves
1             100                      100              365232        11        14   11.18000          1
  max_leaves mean_leaves
1        319   285.42000


H2OBinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **

MSE:  0.2575674
RMSE:  0.507511
LogLoss:  0.7089465
Mean Per-Class Error:  0.5
AUC:  0.4923496
Gini:  -0.01530088

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
       A     B    Error         Rate
A      0  5000 1.000000   =5000/5000
B      0  5000 0.000000      =0/5000
Totals 0 10000 0.500000  =5000/10000

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold    value idx
1                       max f1  0.214495 0.666667 399
2                       max f2  0.214495 0.833333 399
3                 max f0point5  0.214495 0.555556 399
4                 max accuracy  0.617230 0.506600  74
5                max precision  0.621806 0.541833  70
6                   max recall  0.214495 1.000000 399
7              max specificity  0.749866 0.999800   0
8             max absolute_mcc  0.733630 0.042465   6
9   max min_per_class_accuracy  0.499186 0.486400 201
10 max mean_per_class_accuracy  0.617230 0.506600  74

这种结果让我想知道标签是否在h2o.randomforest中起作用。我不经常使用h2o，但上面的结果让我很困惑。这只是由于概率，或者我只是犯了一些愚蠢的错误，还是其他什么？

Answer 1

我认为这是因为，由于数据完全是随机的，因此默认情况下H2O使用的max-f1统计数据不会产生有用的值。

如果强制阈值为0.5，如下所示，您将获得预期的行为。

另外，如果你打开H2O Flow并查看训练模型的ROC曲线，它很糟糕且几乎是一条直线（如你所料）。

library(data.table)
library(h2o)

a <- sample(0:1,10000,replace=T)
b <- sample(0:1,10000,replace=T)
c <- sample(1:10,10000,replace=T)
d <- sample(0:1,10000,replace=T)
e <- sample(0:1,10000,replace=T)
f <- sample(0:1,10000,replace=T)
df = data.frame(a, b, c, d, e, f)
dt = as.data.table(df)
dt[1:5000, label := "A"]
dt[5001:10000, label := "B"]
dt$label = as.factor(dt$label)
dt

h2o.init()
h2o_dt <- as.h2o(dt)
model = h2o.randomForest(y = "label",
                         x = c("a", "b", "c", "d", "e", "f"),
                         training_frame = h2o_dt,
                         ntrees = 10,
                         model_id = "model")
model
h2o_preds = h2o.predict(model, h2o_dt)
preds = as.data.table(h2o_preds)
preds[, prediction := A > 0.5]
table(preds$prediction)

最终输出是：

FALSE  TRUE 
 5085  4915

你可以多次重新运行它，看到值随机反弹，但每组大约5000个。

标签在h2o.randomforest中起作用吗？

1 个答案: