Question

我对vw提取功能的方式感到困惑。考虑一个文本分类问题，我想使用字符ngrams作为功能。在说明我的问题的最简单的情况下，输入字符串是“aa”，我只使用1-gram功能。因此，该示例应包含一个计数为2的单个特征“a”，如下所示：

$ echo "1 |X a:2" | vw --noconstant --invert_hash f && grep '^X^' f
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = 
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
1.000000   1.000000            1         1.0   1.0000   0.0000        1

finished run
number of examples per pass = 1
passes used = 1
weighted example sum = 1
weighted label sum = 1
average loss = 1
best constant = 1
total feature number = 1
X^a:108118:0.196698

但是，如果我将字符串“aa”传递给vw（在字符之间引入空格），则vw报告2个功能：

$ echo "1 |X a a" | vw --noconstant --invert_hash f && grep '^X^' f
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = 
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
1.000000   1.000000            1         1.0   1.0000   0.0000        2

finished run
number of examples per pass = 1
passes used = 1
weighted example sum = 1
weighted label sum = 1
average loss = 1
best constant = 1
total feature number = 2
X^a:108118:0.375311

实际模型只包含一个特征（如我所料），但其重量（0.375311）与第一个模型（0.196698）不同。

在对具有更高阶n-gram的真实数据集进行训练时，可以观察到平均损失的显着差异，这取决于使用哪种输入格式。我查看了parser.cc中的源代码，给了更多时间我可能会弄清楚发生了什么;但如果有人可以解释上述两种情况之间的差异（这是一个错误吗？）和/或指向我相关的部分，我会很感激帮助。

Answer 1

我认为总要素数值只是观察到的要素的计数器。例如，您将获得10以下命令：

$ echo "1 |X a" | vw --noconstant --passes 10 --cache_file f -k

我还在vw代码中看到，在打印输出之前，按功能重量划分功能的回归值。这可以从以下看出：

$ echo "1 |X a:1" | vw --noconstant --invert_hash f && grep '^X^' f
X^a:108118:0.393395
$ echo "1 |X a:2" | vw --noconstant --invert_hash f && grep '^X^' f
X^a:108118:0.196698
$ echo "1 |X a:3" | vw --noconstant --invert_hash f && grep '^X^' f
X^a:108118:0.131132
$ echo "1 |X a:10" | vw --noconstant --invert_hash f && grep '^X^' f
X^a:108118:0.039344

我怀疑这些功能是独占的，例如＆＃34; | X a＆＃34;和＆＃34; | X a a＆＃34;应给出相同的结果，但他们不会：

$ echo "1 |X a" | vw --noconstant --invert_hash f && grep '^X^' f
X^a:108118:0.393395
$ echo "1 |X a a" | vw --noconstant --invert_hash f && grep '^X^' f
X^a:108118:0.375311
$ echo "1 |X a a" | vw --noconstant --invert_hash f && grep '^X^' f
X^a:108118:0.366083

我不知道为什么。这背后应该有一个逻辑。但如果指定--sort_features

，它可以按预期工作（由我）

$ echo "1 |X a" | vw --noconstant --invert_hash f && grep '^X^' f
X^a:108118:0.393395
echo "1 |X a a a a a" | vw --noconstant --invert_hash f --sort_features && grep '^X^' f
X^a:108118:0.393395

有趣的是，如果指定--sort_features vw仅使用第一次出现的功能。例如：

$ echo "1 |X a a:10" | vw --noconstant --invert_hash f --sort_features && grep '^X^' f
X^a:108118:0.393395
$ echo "1 |X a a:2" | vw --noconstant --invert_hash f --sort_features && grep '^X^' f
X^a:108118:0.393395
$ echo "1 |X a:10 a" | vw --noconstant --invert_hash f --sort_features && grep '^X^' f
X^a:108118:0.039344

我希望通过这些观察，您可以根据需要使用vw。但我不确定这是一个错误或功能。将转发vw作者发表评论。

Vowpal Wabbit特征提取

1 个答案: