Question

我有一个带标签的项目数据库，例如：

item1标有"pork with apple sauce"
item2标有"pork"，
item3标有"apple sauce"。

如果我匹配字符串：

“今天我想吃苹果酱猪肉，它会让我感到满意”

对标签，我会得到三个结果。但是，我只想获得最具体的一个，在这种情况下是item1。

这只是一个例子，我没有使用特定的数据库。只需在ruby中进行字符串和映射。我想出了“模糊搜索”。我不确定这是否正确。任何人都可以建议如何解决这个特殊问题吗？

Answer 1

是的，您需要进行模糊匹配（即近似匹配）。这是一个众所周知的问题，手工实现近似匹配算法并不是一件容易的事（但我确信它非常有趣！= D）。有很多东西可以影响两个字符串A和B的“相似”，取决于你认为重要的东西，比如A出现在B中的次数，或者A中单词的顺序和距离有多接近出现在B中，或者A中的“重要”单词出现在B等中

如果你可以使用现有的库，似乎有几个Ruby宝石可以完成工作。例如，使用名为fuzzy-string-match的Jaro-Winkler distance，它使用从Lucene移植的amatch（一个Java库......它似乎也保留了camelCased方法名称的Java约定¬¬）：

require 'fuzzystringmatch'

matcher = FuzzyStringMatch::JaroWinkler.create(:pure)

tags = ["pork with apple sauce", "pork", "apple sauce"]
input = "Today I would like to eat pork with apple sauce, it would fill me up"

# Select the tag by distance to the input string (distance == 1 means perfect 
# match)
best_tag = tags.max_by { |tag| matcher.getDistance(tag, input) }

p best_tag

将正确选择"pork with apple sauce"。

还有另一个名为{{3}}的宝石，它有许多其他近似匹配算法。

Answer 2

根据您的具体使用情况，您可能不需要进行模糊搜索。

也许像这样的非常基本的实现对你来说已经足够了：

class Search
  attr_reader :items, :string

  def initialize(items, string)
    @items  = items
    @string = string.downcase
  end

  def best_match
    items.max_by { |item| rate(item) }
  end

  private

  def rate(item)
    tag_list(item).count { |tag| string.include?(tag) }
  end

  def tag_list(item)
    item[:tags].split(" ")
  end
end

items = [
  { id: :item1, tags: "pork with apple sauce" },
  { id: :item2, tags: "pork" },
  { id: :item3, tags: "apple sauce" }
]

string = "Today I would like to eat pork with apple sauce, it would fill me up"

Search.new(items, string).best_match
#=> {:id=>:item1, :tags=>"pork with apple sauce"}

Answer 3

在将数据库与字符串匹配之前，确定数据库中项目的顺序或特定。你没有在问题中说清楚，但我想你的想法是长度。因此，假设您将数据作为哈希：

h = {
  item1: "pork with apple sauce",
  item2: "pork",
  item3: "apple sauce",
}

然后，您可以按标签的长度对其进行排序，以便较长的标签在列表中排在第一位。同时，您可以将标记转换为正则表达式，这样您就不必担心空间的变化。然后，你会得到一个这样的数组：

a =
h
.sort_by{|_, s| s.length}.reverse
.map{|k, s| [k, Regexp.new("\\b#{s.gsub(/\s+/, '\\s+')}\\b")]}
# =>
# [
#   [
#     :item1,
#     /\bpork\s+with\s+apple\s+sauce\b/
#   ],
#   [
#     :item3,
#     /\bapple\s+sauce\b/
#   ],
#   [
#     :item2,
#     /\bpork\b/
#   ]
# ]

一旦你有了这个，你只需要找到列表中与字符串匹配的第一项。

s = "Today I would like to eat pork with apple sauce, it would fill me up"

a.find{|_, r| s =~ r}[0]
# => :item1

Answer 4

这将适用于通用编程，而不适用于Ruby。

我会对两个字符串进行标记，即针和干草堆，然后在计算出现次数时将它们循环。然后最后比较分数。

一些sudo代码：

needle[] = array of tokens from keysentence
haystack[] array of tokens from search string
int score = 0

do {
  haystackToken = haystack's next token

  do {
    needleToken = needle's next token

    if (haystackToken equals needleToken)
      score += 1

   } while(needle has more token)

} while (haystack has more tokens)

如何将较长的字符串与较短的单词或字符串匹配

4 个答案: