Question

我正在尝试使用vw查找预测某人是否会打开电子邮件的单词或短语。如果打开电子邮件，则目标为1，否则为0。我的数据如下：

1 |A this is a test
0 |A this test is only temporary
1 |A i bought a new polo shirt
1 |A that was a great online sale

我将它放入一个名为'test1.txt'的文件中并运行以下代码来执行ngrams of 2并输出变量信息：

C:\~\vw>perl vw-varinfo.pl -V --ngram 2 test1.txt >> out.txt

当我查看输出时，我看不到原始数据中的bigrams。这是一个错误还是我误解了什么。

输出：

FeatureName            HashVal   MinVal   MaxVal    Weight   RelScore
A^a                     239656     0.00     1.00   +0.1664    100.00%
A^is                      7514     0.00     1.00   +0.0772     46.38%
A^test                   12331     0.00     1.00   +0.0772     46.38%
A^this                  169573     0.00     1.00   +0.0772     46.38%
A^bought                245782     0.00     1.00   +0.0650     39.06%
A^i                     245469     0.00     1.00   +0.0650     39.06%
A^new                    51974     0.00     1.00   +0.0650     39.06%
A^polo                   48680     0.00     1.00   +0.0650     39.06%
A^shirt                  73882     0.00     1.00   +0.0650     39.06%
A^great                 220692     0.00     1.00   +0.0610     36.64%
A^online                147727     0.00     1.00   +0.0610     36.64%
A^sale                  242707     0.00     1.00   +0.0610     36.64%
A^that                  206586     0.00     1.00   +0.0610     36.64%
A^was                   223274     0.00     1.00   +0.0610     36.64%
A^a^bought              216990     0.00     0.00   +0.0000      0.00%
A^bought^great            7122     0.00     0.00   +0.0000      0.00%
A^great^i               190625     0.00     0.00   +0.0000      0.00%
A^i^is                   76227     0.00     0.00   +0.0000      0.00%
A^is^new                140536     0.00     0.00   +0.0000      0.00%
A^new^online             69117     0.00     0.00   +0.0000      0.00%
A^online^only           173498     0.00     0.00   +0.0000      0.00%
A^only^polo              51059     0.00     0.00   +0.0000      0.00%
A^polo^sale             131483     0.00     0.00   +0.0000      0.00%
A^sale^shirt            191329     0.00     0.00   +0.0000      0.00%
A^shirt^temporary        81555     0.00     0.00   +0.0000      0.00%
A^temporary^test         90632     0.00     0.00   +0.0000      0.00%
A^test^that              13689     0.00     0.00   +0.0000      0.00%
A^that^this             127863     0.00     0.00   +0.0000      0.00%
A^this^was               22011     0.00     0.00   +0.0000      0.00%
Constant                116060     0.00     0.00   +0.1465      0.00%
A^only                   62951     0.00     1.00   -0.0490    -29.47%
A^temporary              44641     0.00     1.00   -0.0490    -29.47%

例如，^bought^great实际上从未在任何原始输入行中发生。我做错了吗？

Answer 1

这是vw-varinfo中的一个错误。

这可以通过vw单独运行--invert_hash来验证：

$ vw --ngram 2 test1.txt --invert_hash train.ih

$ grep '^bought^great' train.ih
# no output

快速部分解决方法是将重量为0.0的所有特征视为高度可疑，并且可能是假的。不幸的是，由于vw-varinfo对--ngram一无所知，因此缺少一些功能。

我真的需要重写vw-varinfo。自vw编写以来，vw-varinfo发生了很大的变化，加上vw-varinfo被写为次优，重复了vw本身已经存在的许多交叉特征逻辑。我想到的新实现应该更有效率，更不容易受到这些错误的影响。

由于更紧急的事情，这个项目被搁置了。希望今年能找到一些时间来纠正这个问题。

不相关的提示：由于您正在进行二元分类，因此您应该使用{-1,1}中的标签而不是{0,1}中的标签，并使用--loss_function logistic以获得最佳效果。

Vowpal Wabbit varinfo和ngrams：不存在的组合

1 个答案: