用于文本二进制分类的Vowpal Wabbit python库

时间:2015-09-19 05:49:27

标签: python machine-learning vowpalwabbit

使用pyvw,我实现了类似垃圾邮件过滤器的二元分类器。大多数python实现都是包装器并创建文本文件并使用命令行选项。但有一些good documentation here

我希望代码看起来像这样。

import pyvw
examples = []
examples[1] = (1, "This is spam.")
examples[2] = (-1, "This is ham.")

vw = pyvw.vw("--passes 300 --ngram 3 --cache_file test.cache ")
for text in examples:
    # using dictionary instead of string " |s This is spam."
    ex = vw.example({"s": text[1]}) 
    ex.set_label_string(str(text[0])) 
    ex.learn()

test = vw.example(" | This is also spam.")
test.learn() 
print test.get_updated_prediction() #<-- usually 0.0
print test.get_simplelabel_prediction() #<-- the same for every prediction?
test = vw.example(" | This is certainly ham.")
test.learn() 
print test.get_updated_prediction() #<-- usually 0.0
print test.get_simplelabel_prediction() #<-- the same for every prediction? 

这是我如何实现SequenceLabeler

class SequenceLabeler(pyvw.SearchTask):
    def __init__(self, vw, sch, num_actions):
        # you must must must initialize the parent class
        # this will automatically store self.sch <- sch, self.vw <- vw
        pyvw.SearchTask.__init__(self, vw, sch, num_actions)
        # set whatever options you want
        sch.set_options(sch.AUTO_HAMMING_LOSS | sch.AUTO_CONDITION_FEATURES)

    def _run(self, sentence):   # it's called _run to remind you that you shouldn't call it directly!
        output = []
        #for n in range(len(sentence)):
            # pos, word = sentence[n]
            # use "with...as..." to guarantee that the example is finished properly
            #with self.vw.example({'s': [sentence]}) as ex:
            count = 0
            with self.vw.example({'s': sentence[1]}) as ex:
                # label 0 is not allowed for multiclass.  Valid labels are {1,k}
                pred = self.sch.predict(examples=ex, my_tag=count+1, oracle=sentence[0] + 1) # 
                output.append(pred)
                count += 1
            return output

    sequenceLabeler = vw.init_search_task(SequenceLabeler)   
    for i in xrange(10):
        sequenceLabeler.learn(examples)

我错过了什么? examples in the wiki很好,但没有涵盖这个特定用例。

注释/问题:

  1. 是否需要创建SequenceLabeler(pyvw.SearchTask)?否则,example()似乎没有预测(),而sch就是这样。如何在没有predict()的情况下致电sch
  2. 构建vw.example(),我们是否可以使用{&#34; s&#34;:&#34;以下是某些垃圾邮件的文本。&#34;}等字符的字典,其中s是功能标签? 如果我在SequenceLabeler,我将该课程(1为垃圾邮件)分配给&#34; oracle&#34; sch.predict()的参数?如果我不在SequenceLabeler中,我在哪里设置类?通过set_label_string()
  3. 如何明确设置vw以理解它是二进制分类?二元分类问题的标签应该是0,1还是-1,1?我收到错误"# label 0 is not allowed for multiclass. Valid labels are {1,k}"。我可以通过移动标签轻松避免此错误。这是数据格式问题,还是我错过了参数?此外,似乎有些文档说我可以使用0,1,其他人建议使用-1,1标签。
  4. 修改 Updates from the mailing list回答上面的#1和#3,答案必须是-1,1,并且不需要搜索/序列标签。

0 个答案:

没有答案