单词与红宝石的模糊匹配

时间:2014-01-31 22:24:53

标签: ruby fuzzy-search

我希望将一堆数据与少量服务相匹配

我的数据看起来像这样

{"title" : "blorb",
"category" : "zurb"
"description" : "Massage is the manipulation of superficial and deeper layers of muscle and connective tissue using various techniques, to enhance function, aid in the healing process, decrease muscle reflex activity..."
}

我必须与

匹配
  

[“瑞典按摩”,“理发”]

显然"Swedish Massage"会成为胜利者,但运行基准测试表明"Haircut"是:

require 'amatch'

arr = [:levenshtein_similar, :hamming_similar, :pair_distance_similar, :longest_subsequence_similar, :longest_substring_similar, :jaro_similar, :jarowinkler_similar]

arr.each do |method|
  ["Swedish Massage", "Haircut"].each do |sh|
    pp ">>> #{sh} matched with #{method.to_s}"
    pp sh.send(method, description)
  end
end and nil

结果:

">>> Swedish Massage matched with jaro_similar"
# 0.5246896118183247
">>> Haircut matched with jaro_similar"
# 0.5353606789250354
">>> Swedish Massage matched with jarowinkler_similar"
# 0.5246896118183247
">>> Haircut matched with jarowinkler_similar"
# 0.5353606789250354

其余指数远低于0.1

解决这个问题的更好方法是什么?

1 个答案:

答案 0 :(得分:1)

搜索是精确度和召回之间的持续战斗。您可以尝试的一件事是通过单词分割您的输入 - 这将导致Massage上更强的匹配,但结果扩大了结果集。现在,您将找到仅返回Swedish附近的单词的句子。然后,您可以尝试通过平均多个单词的结果来控制扩展,使用停止列表来避免像and这样的常用单词,用于查找彼此接近的标记等等,但是您永远不会看到真正完美的结果。如果你真的对微调很感兴趣,我建议使用ElasticSearch - 相对容易学习和强大。