使用pyvw,我实现了类似垃圾邮件过滤器的二元分类器。大多数python实现都是包装器并创建文本文件并使用命令行选项。但有一些good documentation here。
我希望代码看起来像这样。
import pyvw
examples = []
examples[1] = (1, "This is spam.")
examples[2] = (-1, "This is ham.")
vw = pyvw.vw("--passes 300 --ngram 3 --cache_file test.cache ")
for text in examples:
# using dictionary instead of string " |s This is spam."
ex = vw.example({"s": text[1]})
ex.set_label_string(str(text[0]))
ex.learn()
test = vw.example(" | This is also spam.")
test.learn()
print test.get_updated_prediction() #<-- usually 0.0
print test.get_simplelabel_prediction() #<-- the same for every prediction?
test = vw.example(" | This is certainly ham.")
test.learn()
print test.get_updated_prediction() #<-- usually 0.0
print test.get_simplelabel_prediction() #<-- the same for every prediction?
这是我如何实现SequenceLabeler
class SequenceLabeler(pyvw.SearchTask):
def __init__(self, vw, sch, num_actions):
# you must must must initialize the parent class
# this will automatically store self.sch <- sch, self.vw <- vw
pyvw.SearchTask.__init__(self, vw, sch, num_actions)
# set whatever options you want
sch.set_options(sch.AUTO_HAMMING_LOSS | sch.AUTO_CONDITION_FEATURES)
def _run(self, sentence): # it's called _run to remind you that you shouldn't call it directly!
output = []
#for n in range(len(sentence)):
# pos, word = sentence[n]
# use "with...as..." to guarantee that the example is finished properly
#with self.vw.example({'s': [sentence]}) as ex:
count = 0
with self.vw.example({'s': sentence[1]}) as ex:
# label 0 is not allowed for multiclass. Valid labels are {1,k}
pred = self.sch.predict(examples=ex, my_tag=count+1, oracle=sentence[0] + 1) #
output.append(pred)
count += 1
return output
sequenceLabeler = vw.init_search_task(SequenceLabeler)
for i in xrange(10):
sequenceLabeler.learn(examples)
我错过了什么? examples in the wiki很好,但没有涵盖这个特定用例。
注释/问题:
SequenceLabeler(pyvw.SearchTask)
?否则,example()
似乎没有预测(),而sch
就是这样。如何在没有predict()
的情况下致电sch
?vw.example()
,我们是否可以使用{&#34; s&#34;:&#34;以下是某些垃圾邮件的文本。&#34;}等字符的字典,其中s是功能标签?
如果我在SequenceLabeler
,我将该课程(1为垃圾邮件)分配给&#34; oracle&#34; sch.predict()
的参数?如果我不在SequenceLabeler中,我在哪里设置类?通过set_label_string()
?"# label 0 is not allowed for multiclass. Valid labels are {1,k}"
。我可以通过移动标签轻松避免此错误。这是数据格式问题,还是我错过了参数?此外,似乎有些文档说我可以使用0,1,其他人建议使用-1,1标签。 修改 Updates from the mailing list回答上面的#1和#3,答案必须是-1,1,并且不需要搜索/序列标签。