使用Vowpal Wabbit进行训练时,为什么平均损失会增加

时间:2015-08-11 05:43:45

标签: vowpalwabbit

我尝试使用VW在一小组示例(约3112)上训练回归模型。我想我正确地做到了,但它给我带来了奇怪的结果。挖了一下,但没有找到任何有用的东西。

$ cat sh600000.feat | vw --l1 1e-8 --l2 1e-8 --readable_model model -b 24 --passes 10 --cache_file cache
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = cache
ignoring text input in favor of cache input
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.040000   0.040000            1         1.0  -0.2000   0.0000       79
0.051155   0.062310            2         2.0   0.2000  -0.0496       79
0.046606   0.042056            4         4.0   0.4100   0.1482       79
0.052160   0.057715            8         8.0   0.0200   0.0021       78
0.064936   0.077711           16        16.0  -0.1800   0.0547       77
0.060507   0.056079           32        32.0   0.0000   0.3164       79
0.136933   0.213358           64        64.0  -0.5900  -0.0850       79
0.151692   0.166452          128       128.0   0.0700   0.0060       79
0.133965   0.116238          256       256.0   0.0900  -0.0446       78
0.179995   0.226024          512       512.0   0.3700  -0.0217       79
0.109296   0.038597         1024      1024.0   0.1200  -0.0728       79
0.579360   1.049425         2048      2048.0  -0.3700  -0.0084       79
0.485389   0.485389         4096      4096.0   1.9600   0.3934       79 h
0.517748   0.550036         8192      8192.0   0.0700   0.0334       79 h

finished run
number of examples per pass = 2847
passes used = 5
weighted example sum = 14236
weighted label sum = -155.98
average loss = 0.490685 h
best constant = -0.0109567
total feature number = 1121506


$ wc model
      41      48     657 model

问题:

  1. 为什么输出(可读)模型中的要素数量少于实际要素数量?我指出训练数据包含78个特征(加上训练期间显示的79的偏差)。特征位的数量是24,这应该足以避免碰撞。

  2. 为什么平均损失实际上会在培训中上升,如上例所示?

  3. (次要)我试图将功能位数增加到32,然后输出一个空模型。为什么呢?

  4. 编辑:

    我试图改变输入文件,以及使用--holdout_off,如建议的那样。但结果仍然几乎相同 - 平均损失上升。

    $ cat sh600000.feat.shuf | vw --l1 1e-8 --l2 1e-8 --readable_model model -b 24 --passes 10 --cache_file cache --holdout_off
    using l1 regularization = 1e-08
    using l2 regularization = 1e-08
    Num weight bits = 24
    learning rate = 0.5
    initial_t = 0
    power_t = 0.5
    decay_learning_rate = 1
    using cache_file = cache
    ignoring text input in favor of cache input
    num sources = 1
    average    since         example     example  current  current  current
    loss       last          counter      weight    label  predict features
    0.040000   0.040000            1         1.0  -0.2000   0.0000       79
    0.051155   0.062310            2         2.0   0.2000  -0.0496       79
    0.046606   0.042056            4         4.0   0.4100   0.1482       79
    0.052160   0.057715            8         8.0   0.0200   0.0021       78
    0.071332   0.090504           16        16.0   0.0300   0.1203       79
    0.043720   0.016108           32        32.0  -0.2200  -0.1971       78
    0.142895   0.242071           64        64.0   0.0100  -0.1531       79
    0.158564   0.174232          128       128.0   0.0500  -0.0439       79
    0.150691   0.142818          256       256.0   0.3200   0.1466       79
    0.197050   0.243408          512       512.0   0.2300  -0.0459       79
    0.117398   0.037747         1024      1024.0   0.0400   0.0284       79
    0.636949   1.156501         2048      2048.0   1.2500  -0.0152       79
    0.363364   0.089779         4096      4096.0   0.1800   0.0071       79
    0.477569   0.591774         8192      8192.0  -0.4800   0.0065       79
    0.411068   0.344567        16384     16384.0   0.0700   0.0450       77
    
    finished run
    number of examples per pass = 3112
    passes used = 10
    weighted example sum = 31120
    weighted label sum = -105.5
    average loss = 0.423404
    best constant = -0.0033901
    total feature number = 2451800
    

    训练样例彼此独特,所以我怀疑是否存在过度拟合问题(据我所知,通常在输入数量与特征数量相比较小时发生)。

    EDIT2:

    尝试打印每个示例的平均损失,并看到它大部分保持不变。

    $ cat dist/sh600000.feat | vw --l1 1e-8 --l2 1e-8 -f dist/model -P 3112 --passes 10 -b 24 --cache_file dist/cache
    using l1 regularization = 1e-08
    using l2 regularization = 1e-08
    Num weight bits = 24
    learning rate = 0.5
    initial_t = 0
    power_t = 0.5
    decay_learning_rate = 1
    final_regressor = dist/model
    using cache_file = dist/cache
    ignoring text input in favor of cache input
    num sources = 1
    average    since         example     example  current  current  current
    loss       last          counter      weight    label  predict features
    0.498822   0.498822         3112      3112.0   0.0800   0.0015       79 h
    0.476677   0.454595         6224      6224.0  -0.2200  -0.0085       79 h
    0.466413   0.445856         9336      9336.0   0.0200  -0.0022       79 h
    0.490221   0.561506        12448     12448.0   0.0700  -0.1113       79 h
    
    finished run
    number of examples per pass = 2847
    passes used = 5
    weighted example sum = 14236
    weighted label sum = -155.98
    average loss = 0.490685 h
    best constant = -0.0109567
    total feature number = 1121506
    

    另外尝试没有--l1, - l2和-b参数:

    $ cat dist/sh600000.feat | vw -f dist/model -P 3112 --passes 10 --cache_file dist/cacheNum weight bits = 18
    learning rate = 0.5
    initial_t = 0
    power_t = 0.5
    decay_learning_rate = 1
    final_regressor = dist/model
    using cache_file = dist/cache
    ignoring text input in favor of cache input
    num sources = 1
    average    since         example     example  current  current  current
    loss       last          counter      weight    label  predict features
    0.520286   0.520286         3112      3112.0   0.0800  -0.0021       79 h
    0.488581   0.456967         6224      6224.0  -0.2200  -0.0137       79 h
    0.474247   0.445538         9336      9336.0   0.0200  -0.0299       79 h
    0.496580   0.563450        12448     12448.0   0.0700  -0.1727       79 h
    0.533413   0.680958        15560     15560.0  -0.1700   0.0322       79 h
    0.524531   0.480201        18672     18672.0  -0.9800  -0.0573       79 h
    
    finished run
    number of examples per pass = 2801
    passes used = 7
    weighted example sum = 19608
    weighted label sum = -212.58
    average loss = 0.491739 h
    best constant = -0.0108415
    total feature number = 1544713
    

    这是否意味着平均损失在一次传球中上升是正常的,但只要多次传球得到同样的损失就可以了,那么它很好吗?

1 个答案:

答案 0 :(得分:2)

  1. 模型文件仅存储非零权重。因此,如果您使用--l1
  2. ,其他人很可能会被取消
  3. 可能由多种原因引起。也许你的数据集没有足够好地改组。如果您对数据集进行排序,那么标记为-1的示例将位于上半部分,标记为1的示例将位于秒中,那么您的模型将在上半部分显示非常好的收敛,但是当您到达下半部分时,您将看到平均损失。所以它可能是数据集的不平衡。至于最后两次损失 - 这些是保持损失(在行尾标有'h')并且可能表明该模型已过度装配。请参考我的other answer
  4. 好吧,在主分支中,-b 32的使用情况甚至被阻止。你应该使用up to -b 31。在练习-b 24-28通常就足够了,甚至成千上万的功能。
  5. 我建议您从github

    获取最新的大众版本