我正在尝试进行一对一的逻辑回归,根据使用vowpal wabbit的文本按主题类别对编辑文章进行分类。当我尝试使用用于训练的相同数据来预测新文章时,我的结果很差,但我会期望由于过度拟合而产生不切实际的好结果。在这种情况下,我实际上想要过度拟合,因为我想验证我正确使用vowpal wabbit。
我的模型正在接受关于这样的示例的训练,其中每个特征是文章中的单词,并且每个标签是类别的标识符,例如体育或娱乐:
1 | the baseball player ... stadium
4 | musicians played all ... crowd
...
2 | fish are an ... squid
我的训练命令如下所示:
vw --oaa=19 --loss_function=logistic --save_resume -d /tmp/train.vw -f /tmp/model.vw
我的测试命令如下所示:
vw -t --probabilities --loss_function=logistic --link=logistic -d /tmp/test.vw -i /tmp/model.vw -p /tmp/predict.vw --raw_predictions=/tmp/predictions_raw.vw
我正在使用--probabilities
和--link=logistic
,因为我希望我的结果可以解释为该文章属于该类的概率。
我的数据集大小(81个示例和52000个特征)存在明显的问题,但我预计这会导致严重的过度拟合,因此在与训练数据相同的数据集上进行的任何预测都会非常好。 我的vowpal wabbit命令出错了吗?我对数据科学的了解是什么?
以下是training命令的输出:
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = /tmp/train.vw
num sources = 1
average since example example current current current
loss last counter weight label predict features
1.000000 1.000000 1 1.0 15 1 451
1.000000 1.000000 2 2.0 8 15 296
1.000000 1.000000 4 4.0 8 7 333
0.875000 0.750000 8 8.0 15 15 429
0.500000 0.125000 16 16.0 8 7 305
0.531250 0.562500 32 32.0 12 8 117
0.500000 0.468750 64 64.0 3 15 117
finished run
number of examples per pass = 81
passes used = 1
weighted example sum = 81.000000
weighted label sum = 0.000000
average loss = 0.518519
total feature number = 52703
对于测试命令:
only testing
predictions = /tmp/predict.vw
raw predictions = /tmp/predictions_raw.vw
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = /tmp/test.vw
num sources = 1
average since example example current current current
loss last counter weight label predict features
1.000000 -0.015873 1 1.0 4294967295 3( 7%) 117
1.000000 1.000000 2 2.0 4294967295 3( 7%) 88
1.000000 1.000000 4 4.0 4294967295 3( 7%) 188
1.000000 1.000000 8 8.0 4294967295 9( 7%) 1175
1.000000 1.000000 16 16.0 4294967295 5( 7%) 883
1.000000 1.000000 32 32.0 4294967295 7( 7%) 229
1.000000 1.000000 64 64.0 4294967295 15( 7%) 304
finished run
number of examples per pass = 40
passes used = 2
weighted example sum = 81.000000
weighted label sum = 0.000000
average loss = 1.000000
average multiclass log loss = 999.000000
total feature number = 52703
答案 0 :(得分:0)
I believe my main problem was just that I needed to run more passes. I don't quite understand how vw implements online learning and how this differs from doing batch learning, but after running multiple passes the average loss dropped to 13%. With --holdout_off
enabled this loss dropped further to %1. Many thanks to @arielf and @MartinPopel
Running training command with 2421 examples: vw --oaa=19 --loss_function=logistic --save_resume -c --passes 10 -d /tmp/train.vw -f /tmp/model.vw
final_regressor = /tmp/model.vw
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = /tmp/train.vw.cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
1.000000 1.000000 1 1.0 11 1 234
1.000000 1.000000 2 2.0 6 11 651
1.000000 1.000000 4 4.0 2 12 1157
1.000000 1.000000 8 8.0 4 2 74
1.000000 1.000000 16 16.0 12 15 171
0.906250 0.812500 32 32.0 9 6 6
0.750000 0.593750 64 64.0 15 19 348
0.625000 0.500000 128 128.0 12 12 110
0.566406 0.507812 256 256.0 12 5 176
0.472656 0.378906 512 512.0 5 5 168
0.362305 0.251953 1024 1024.0 16 8 274
0.293457 0.224609 2048 2048.0 3 4 118
0.224670 0.224670 4096 4096.0 8 8 850 h
0.191419 0.158242 8192 8192.0 6 6 249 h
0.164926 0.138462 16384 16384.0 3 4 154 h
finished run
number of examples per pass = 2179
passes used = 10
weighted example sum = 21790.000000
weighted label sum = 0.000000
average loss = 0.132231 h
total feature number = 12925010