Question

我正在尝试进行一对一的逻辑回归，根据使用vowpal wabbit的文本按主题类别对编辑文章进行分类。当我尝试使用用于训练的相同数据来预测新文章时，我的结果很差，但我会期望由于过度拟合而产生不切实际的好结果。在这种情况下，我实际上想要过度拟合，因为我想验证我正确使用vowpal wabbit。

我的模型正在接受关于这样的示例的训练，其中每个特征是文章中的单词，并且每个标签是类别的标识符，例如体育或娱乐： 1 | the baseball player ... stadium 4 | musicians played all ... crowd ... 2 | fish are an ... squid

我的训练命令如下所示： vw --oaa=19 --loss_function=logistic --save_resume -d /tmp/train.vw -f /tmp/model.vw

我的测试命令如下所示： vw -t --probabilities --loss_function=logistic --link=logistic -d /tmp/test.vw -i /tmp/model.vw -p /tmp/predict.vw --raw_predictions=/tmp/predictions_raw.vw

我正在使用--probabilities和--link=logistic，因为我希望我的结果可以解释为该文章属于该类的概率。

我的数据集大小（81个示例和52000个特征）存在明显的问题，但我预计这会导致严重的过度拟合，因此在与训练数据相同的数据集上进行的任何预测都会非常好。 我的vowpal wabbit命令出错了吗？我对数据科学的了解是什么？

以下是training命令的输出：

Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = /tmp/train.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0       15        1      451
1.000000 1.000000            2            2.0        8       15      296
1.000000 1.000000            4            4.0        8        7      333
0.875000 0.750000            8            8.0       15       15      429
0.500000 0.125000           16           16.0        8        7      305
0.531250 0.562500           32           32.0       12        8      117
0.500000 0.468750           64           64.0        3       15      117

finished run
number of examples per pass = 81
passes used = 1
weighted example sum = 81.000000
weighted label sum = 0.000000
average loss = 0.518519
total feature number = 52703

对于测试命令：

only testing
predictions = /tmp/predict.vw
raw predictions = /tmp/predictions_raw.vw
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = /tmp/test.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 -0.015873            1            1.0 4294967295   3( 7%)      117
1.000000 1.000000            2            2.0 4294967295   3( 7%)       88
1.000000 1.000000            4            4.0 4294967295   3( 7%)      188
1.000000 1.000000            8            8.0 4294967295   9( 7%)     1175
1.000000 1.000000           16           16.0 4294967295   5( 7%)      883
1.000000 1.000000           32           32.0 4294967295   7( 7%)      229
1.000000 1.000000           64           64.0 4294967295  15( 7%)      304

finished run
number of examples per pass = 40
passes used = 2
weighted example sum = 81.000000
weighted label sum = 0.000000
average loss = 1.000000
average multiclass log loss = 999.000000
total feature number = 52703

Answer 1

I believe my main problem was just that I needed to run more passes. I don't quite understand how vw implements online learning and how this differs from doing batch learning, but after running multiple passes the average loss dropped to 13%. With --holdout_off enabled this loss dropped further to %1. Many thanks to @arielf and @MartinPopel

Running training command with 2421 examples: vw --oaa=19 --loss_function=logistic --save_resume -c --passes 10 -d /tmp/train.vw -f /tmp/model.vw
final_regressor = /tmp/model.vw
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = /tmp/train.vw.cache
ignoring text input in favor of cache input
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0       11        1      234
1.000000 1.000000            2            2.0        6       11      651
1.000000 1.000000            4            4.0        2       12     1157
1.000000 1.000000            8            8.0        4        2       74
1.000000 1.000000           16           16.0       12       15      171
0.906250 0.812500           32           32.0        9        6        6
0.750000 0.593750           64           64.0       15       19      348
0.625000 0.500000          128          128.0       12       12      110
0.566406 0.507812          256          256.0       12        5      176
0.472656 0.378906          512          512.0        5        5      168
0.362305 0.251953         1024         1024.0       16        8      274
0.293457 0.224609         2048         2048.0        3        4      118
0.224670 0.224670         4096         4096.0        8        8      850 h
0.191419 0.158242         8192         8192.0        6        6      249 h
0.164926 0.138462        16384        16384.0        3        4      154 h

finished run
number of examples per pass = 2179
passes used = 10
weighted example sum = 21790.000000
weighted label sum = 0.000000
average loss = 0.132231 h
total feature number = 12925010

使用Vowpal Wabbit的一对一逻辑回归分类器

1 个答案: