Question

使用pyvw，我实现了类似垃圾邮件过滤器的二元分类器。大多数python实现都是包装器并创建文本文件并使用命令行选项。但有一些good documentation here。

我希望代码看起来像这样。

import pyvw
examples = []
examples[1] = (1, "This is spam.")
examples[2] = (-1, "This is ham.")

vw = pyvw.vw("--passes 300 --ngram 3 --cache_file test.cache ")
for text in examples:
    # using dictionary instead of string " |s This is spam."
    ex = vw.example({"s": text[1]}) 
    ex.set_label_string(str(text[0])) 
    ex.learn()

test = vw.example(" | This is also spam.")
test.learn() 
print test.get_updated_prediction() #<-- usually 0.0
print test.get_simplelabel_prediction() #<-- the same for every prediction?
test = vw.example(" | This is certainly ham.")
test.learn() 
print test.get_updated_prediction() #<-- usually 0.0
print test.get_simplelabel_prediction() #<-- the same for every prediction?

这是我如何实现SequenceLabeler

class SequenceLabeler(pyvw.SearchTask):
    def __init__(self, vw, sch, num_actions):
        # you must must must initialize the parent class
        # this will automatically store self.sch <- sch, self.vw <- vw
        pyvw.SearchTask.__init__(self, vw, sch, num_actions)
        # set whatever options you want
        sch.set_options(sch.AUTO_HAMMING_LOSS | sch.AUTO_CONDITION_FEATURES)

    def _run(self, sentence):   # it's called _run to remind you that you shouldn't call it directly!
        output = []
        #for n in range(len(sentence)):
            # pos, word = sentence[n]
            # use "with...as..." to guarantee that the example is finished properly
            #with self.vw.example({'s': [sentence]}) as ex:
            count = 0
            with self.vw.example({'s': sentence[1]}) as ex:
                # label 0 is not allowed for multiclass.  Valid labels are {1,k}
                pred = self.sch.predict(examples=ex, my_tag=count+1, oracle=sentence[0] + 1) # 
                output.append(pred)
                count += 1
            return output

    sequenceLabeler = vw.init_search_task(SequenceLabeler)   
    for i in xrange(10):
        sequenceLabeler.learn(examples)

我错过了什么？ examples in the wiki很好，但没有涵盖这个特定用例。

注释/问题：

是否需要创建SequenceLabeler(pyvw.SearchTask)？否则，example()似乎没有预测（），而sch就是这样。如何在没有predict()的情况下致电sch？
构建vw.example()，我们是否可以使用{＆＃34; s＆＃34;：＆＃34;以下是某些垃圾邮件的文本。＆＃34;}等字符的字典，其中s是功能标签？如果我在SequenceLabeler，我将该课程（1为垃圾邮件）分配给＆＃34; oracle＆＃34; sch.predict()的参数？如果我不在SequenceLabeler中，我在哪里设置类？通过set_label_string()？
如何明确设置vw以理解它是二进制分类？二元分类问题的标签应该是0,1还是-1,1？我收到错误"# label 0 is not allowed for multiclass. Valid labels are {1,k}"。我可以通过移动标签轻松避免此错误。这是数据格式问题，还是我错过了参数？此外，似乎有些文档说我可以使用0,1，其他人建议使用-1,1标签。

修改 Updates from the mailing list回答上面的＃1和＃3，答案必须是-1,1，并且不需要搜索/序列标签。

用于文本二进制分类的Vowpal Wabbit python库

0 个答案: