我希望将一堆数据与少量服务相匹配
我的数据看起来像这样
{"title" : "blorb",
"category" : "zurb"
"description" : "Massage is the manipulation of superficial and deeper layers of muscle and connective tissue using various techniques, to enhance function, aid in the healing process, decrease muscle reflex activity..."
}
我必须与
匹配[“瑞典按摩”,“理发”]
显然"Swedish Massage"
会成为胜利者,但运行基准测试表明"Haircut"
是:
require 'amatch'
arr = [:levenshtein_similar, :hamming_similar, :pair_distance_similar, :longest_subsequence_similar, :longest_substring_similar, :jaro_similar, :jarowinkler_similar]
arr.each do |method|
["Swedish Massage", "Haircut"].each do |sh|
pp ">>> #{sh} matched with #{method.to_s}"
pp sh.send(method, description)
end
end and nil
结果:
">>> Swedish Massage matched with jaro_similar"
# 0.5246896118183247
">>> Haircut matched with jaro_similar"
# 0.5353606789250354
">>> Swedish Massage matched with jarowinkler_similar"
# 0.5246896118183247
">>> Haircut matched with jarowinkler_similar"
# 0.5353606789250354
其余指数远低于0.1
解决这个问题的更好方法是什么?
答案 0 :(得分:1)
搜索是精确度和召回之间的持续战斗。您可以尝试的一件事是通过单词分割您的输入 - 这将导致Massage
上更强的匹配,但结果扩大了结果集。现在,您将找到仅返回Swedish
附近的单词的句子。然后,您可以尝试通过平均多个单词的结果来控制扩展,使用停止列表来避免像and
这样的常用单词,用于查找彼此接近的标记等等,但是您永远不会看到真正完美的结果。如果你真的对微调很感兴趣,我建议使用ElasticSearch - 相对容易学习和强大。