对于多类预测,通过遵循为this gem给出的库示例,会返回稍微不准确的预测。
测试集(老师对上课迟到但后来道歉的学生大吼大叫。)应该已经返回EDUCATION
而不是HEALTH
require 'libsvm'
# Let take our documents and create word vectors out of them.
#
documents = [ # 0 is JOKES, 1 is EDUCATION and 2 is HEALTH
[0, "Why did the chicken cross the road? Because a car was coming"],
[0, "You're an elevator tech? I bet that job has its ups and downs"],
[0, "Why did the chicken cross the road? To get the worm"],
[1, "The university admitted more students this year and dropout rate is lessening."],
[1, "The students turned in their homework at school before summer break."],
[1, "The students and teachers agreed on a plan for study."],
[2, "The cold outbreak was bad but not an epidemic."],
[2, "The doctor and the nurse advised be to get rest because of my cold."],
[2, "The doctor had to go to the hospital."]
]
# Lets create a dictionary of unique words and then we can
# create our vectors. This is a very simple example. If you
# were doing this in a production system you'd do things like
# stemming and removing all punctuation (in a less casual way).
#
dictionary = documents.map(&:last).map(&:split).flatten.uniq
dictionary = dictionary.map { |x| x.gsub(/\?|,|\.|\-/,'') }
training_set = []
documents.each do |doc|
@features_array = dictionary.map { |x| doc.last.include?(x) ? 1 : 0 }
training_set << [doc.first, Libsvm::Node.features(@features_array)]
end
# Lets set up libsvm so that we can test our prediction
# using the test set
#
problem = Libsvm::Problem.new
parameter = Libsvm::SvmParameter.new
parameter.cache_size = 1 # in megabytes
parameter.eps = 0.001
parameter.c = 10
# Train classifier using training set
#
problem.set_examples(training_set.map(&:first), training_set.map(&:last))
model = Libsvm::Model.train(problem, parameter)
# Now lets test our classifier using the test set
#
test_set = [1, "The teacher yelled at the student who was late to class but later apologized."]
test_document = test_set.last.split.map{ |x| x.gsub(/\?|,|\.|\-/,'') }
doc_features = dictionary.map{|x| test_document.include?(x) ? 1 : 0 }
pred = model.predict(Libsvm::Node.features(doc_features))
puts pred # returns 2.0 BUT should have been 1.0
result = case pred
when 0.0 then "predicted #{pred} as joke"
when 1.0 then "predicted #{pred} as education"
when 2.0 then "predicted #{pred} as health"
end
puts result
代码问题或需要尝试其他内核和参数。
答案 0 :(得分:0)
代码本身没有具体问题。原因很简单就是缺乏培训数据。
尝试使用&#34;大学今年录取了更多的学生,辍学率正在下降。&#34;,与训练集中的一个实例完全相同,作为测试实例。该计划将其正确分类为教育。
SVM培训的3个实例是不够的。使用更多训练数据和使用交叉验证调整参数C的最佳方法。