Question

我尝试使用VW在一小组示例（约3112）上训练回归模型。我想我正确地做到了，但它给我带来了奇怪的结果。挖了一下，但没有找到任何有用的东西。

$ cat sh600000.feat | vw --l1 1e-8 --l2 1e-8 --readable_model model -b 24 --passes 10 --cache_file cache
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = cache
ignoring text input in favor of cache input
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.040000   0.040000            1         1.0  -0.2000   0.0000       79
0.051155   0.062310            2         2.0   0.2000  -0.0496       79
0.046606   0.042056            4         4.0   0.4100   0.1482       79
0.052160   0.057715            8         8.0   0.0200   0.0021       78
0.064936   0.077711           16        16.0  -0.1800   0.0547       77
0.060507   0.056079           32        32.0   0.0000   0.3164       79
0.136933   0.213358           64        64.0  -0.5900  -0.0850       79
0.151692   0.166452          128       128.0   0.0700   0.0060       79
0.133965   0.116238          256       256.0   0.0900  -0.0446       78
0.179995   0.226024          512       512.0   0.3700  -0.0217       79
0.109296   0.038597         1024      1024.0   0.1200  -0.0728       79
0.579360   1.049425         2048      2048.0  -0.3700  -0.0084       79
0.485389   0.485389         4096      4096.0   1.9600   0.3934       79 h
0.517748   0.550036         8192      8192.0   0.0700   0.0334       79 h

finished run
number of examples per pass = 2847
passes used = 5
weighted example sum = 14236
weighted label sum = -155.98
average loss = 0.490685 h
best constant = -0.0109567
total feature number = 1121506


$ wc model
      41      48     657 model

问题：

为什么输出（可读）模型中的要素数量少于实际要素数量？我指出训练数据包含78个特征（加上训练期间显示的79的偏差）。特征位的数量是24，这应该足以避免碰撞。
为什么平均损失实际上会在培训中上升，如上例所示？
（次要）我试图将功能位数增加到32，然后输出一个空模型。为什么呢？

编辑：

我试图改变输入文件，以及使用--holdout_off，如建议的那样。但结果仍然几乎相同 - 平均损失上升。

$ cat sh600000.feat.shuf | vw --l1 1e-8 --l2 1e-8 --readable_model model -b 24 --passes 10 --cache_file cache --holdout_off
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = cache
ignoring text input in favor of cache input
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.040000   0.040000            1         1.0  -0.2000   0.0000       79
0.051155   0.062310            2         2.0   0.2000  -0.0496       79
0.046606   0.042056            4         4.0   0.4100   0.1482       79
0.052160   0.057715            8         8.0   0.0200   0.0021       78
0.071332   0.090504           16        16.0   0.0300   0.1203       79
0.043720   0.016108           32        32.0  -0.2200  -0.1971       78
0.142895   0.242071           64        64.0   0.0100  -0.1531       79
0.158564   0.174232          128       128.0   0.0500  -0.0439       79
0.150691   0.142818          256       256.0   0.3200   0.1466       79
0.197050   0.243408          512       512.0   0.2300  -0.0459       79
0.117398   0.037747         1024      1024.0   0.0400   0.0284       79
0.636949   1.156501         2048      2048.0   1.2500  -0.0152       79
0.363364   0.089779         4096      4096.0   0.1800   0.0071       79
0.477569   0.591774         8192      8192.0  -0.4800   0.0065       79
0.411068   0.344567        16384     16384.0   0.0700   0.0450       77

finished run
number of examples per pass = 3112
passes used = 10
weighted example sum = 31120
weighted label sum = -105.5
average loss = 0.423404
best constant = -0.0033901
total feature number = 2451800

训练样例彼此独特，所以我怀疑是否存在过度拟合问题（据我所知，通常在输入数量与特征数量相比较小时发生）。

EDIT2：

尝试打印每个示例的平均损失，并看到它大部分保持不变。

$ cat dist/sh600000.feat | vw --l1 1e-8 --l2 1e-8 -f dist/model -P 3112 --passes 10 -b 24 --cache_file dist/cache
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
final_regressor = dist/model
using cache_file = dist/cache
ignoring text input in favor of cache input
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.498822   0.498822         3112      3112.0   0.0800   0.0015       79 h
0.476677   0.454595         6224      6224.0  -0.2200  -0.0085       79 h
0.466413   0.445856         9336      9336.0   0.0200  -0.0022       79 h
0.490221   0.561506        12448     12448.0   0.0700  -0.1113       79 h

finished run
number of examples per pass = 2847
passes used = 5
weighted example sum = 14236
weighted label sum = -155.98
average loss = 0.490685 h
best constant = -0.0109567
total feature number = 1121506

另外尝试没有--l1， - l2和-b参数：

$ cat dist/sh600000.feat | vw -f dist/model -P 3112 --passes 10 --cache_file dist/cacheNum weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
final_regressor = dist/model
using cache_file = dist/cache
ignoring text input in favor of cache input
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.520286   0.520286         3112      3112.0   0.0800  -0.0021       79 h
0.488581   0.456967         6224      6224.0  -0.2200  -0.0137       79 h
0.474247   0.445538         9336      9336.0   0.0200  -0.0299       79 h
0.496580   0.563450        12448     12448.0   0.0700  -0.1727       79 h
0.533413   0.680958        15560     15560.0  -0.1700   0.0322       79 h
0.524531   0.480201        18672     18672.0  -0.9800  -0.0573       79 h

finished run
number of examples per pass = 2801
passes used = 7
weighted example sum = 19608
weighted label sum = -212.58
average loss = 0.491739 h
best constant = -0.0108415
total feature number = 1544713

这是否意味着平均损失在一次传球中上升是正常的，但只要多次传球得到同样的损失就可以了，那么它很好吗？

Answer 1

模型文件仅存储非零权重。因此，如果您使用--l1
可能由多种原因引起。也许你的数据集没有足够好地改组。如果您对数据集进行排序，那么标记为-1的示例将位于上半部分，标记为1的示例将位于秒中，那么您的模型将在上半部分显示非常好的收敛，但是当您到达下半部分时，您将看到平均损失。所以它可能是数据集的不平衡。至于最后两次损失 - 这些是保持损失（在行尾标有'h'）并且可能表明该模型已过度装配。请参考我的other answer。
好吧，在主分支中，-b 32的使用情况甚至被阻止。你应该使用up to -b 31。在练习-b 24-28通常就足够了，甚至成千上万的功能。

我建议您从github

使用Vowpal Wabbit进行训练时，为什么平均损失会增加

1 个答案: