ruby libsvm用于多类问题

时间:2017-04-10 13:47:20

标签: ruby machine-learning svm libsvm

对于多类预测,通过遵循为this gem给出的库示例,会返回稍微不准确的预测。

测试集(老师对上课迟到但后来道歉的学生大吼大叫。)应该已经返回EDUCATION而不是HEALTH

require 'libsvm'

# Let take our documents and create word vectors out of them.
#
documents = [ # 0 is JOKES, 1 is EDUCATION and 2 is HEALTH
            [0, "Why did the chicken cross the road? Because a car was coming"],
            [0, "You're an elevator tech? I bet that job has its ups and downs"],
            [0, "Why did the chicken cross the road? To get the worm"],

            [1, "The university admitted more students this year and dropout rate is lessening."],
            [1, "The students turned in their homework at school before summer break."], 
            [1, "The students and teachers agreed on a plan for study."], 

            [2, "The cold outbreak was bad but not an epidemic."],
            [2, "The doctor and the nurse advised be to get rest because of my cold."],
            [2, "The doctor had to go to the hospital."]
         ]

# Lets create a dictionary of unique words and then we can
# create our vectors.  This is a very simple example.  If you
# were doing this in a production system you'd do things like
# stemming and removing all punctuation (in a less casual way).
#
dictionary = documents.map(&:last).map(&:split).flatten.uniq
dictionary = dictionary.map { |x| x.gsub(/\?|,|\.|\-/,'') }

training_set = []
documents.each do |doc|
  @features_array = dictionary.map { |x| doc.last.include?(x) ? 1 : 0 }
  training_set << [doc.first, Libsvm::Node.features(@features_array)]
end

# Lets set up libsvm so that we can test our prediction
# using the test set
#
problem = Libsvm::Problem.new
parameter = Libsvm::SvmParameter.new

parameter.cache_size = 1 # in megabytes
parameter.eps = 0.001
parameter.c   = 10

# Train classifier using training set
#
problem.set_examples(training_set.map(&:first), training_set.map(&:last))
model = Libsvm::Model.train(problem, parameter)

# Now lets test our classifier using the test set
#
test_set = [1, "The teacher yelled at the student who was late to class but later apologized."]
test_document = test_set.last.split.map{ |x| x.gsub(/\?|,|\.|\-/,'') }

doc_features = dictionary.map{|x| test_document.include?(x) ? 1 : 0 }
pred = model.predict(Libsvm::Node.features(doc_features))
puts pred # returns 2.0 BUT should have been 1.0
result = case pred
    when 0.0 then "predicted #{pred} as joke"
    when 1.0 then "predicted #{pred} as education"
    when 2.0 then "predicted #{pred} as health"
end
puts result

代码问题或需要尝试其他内核和参数。

1 个答案:

答案 0 :(得分:0)

代码本身没有具体问题。原因很简单就是缺乏培训数据。

尝试使用&#34;大学今年录取了更多的学生,辍学率正在下降。&#34;,与训练集中的一个实例完全相同,作为测试实例。该计划将其正确分类为教育。

SVM培训的3个实例是不够的。使用更多训练数据和使用交叉验证调整参数C的最佳方法。