Question

我正在阅读有关在字符串中查找子字符串的后缀数组方法，请参阅（http://www.codeodor.com/index.cfm/2007/12/24/The-Suffix-Array/1845）例如

sa = SuffixArray.new("abracadabra")
puts sa.find_substring("aca")

其中SuffixArray是后缀数组的实现，find_substring是一种搜索子字符串开始位置的方法。

我的问题是如何在子字符串中允许给定数量的不匹配时实现此搜索？例如，

max_mismatches = 2
search_string ="abrazadabra"
substring ="aca"

sa = SuffixArray.new("search_string")
puts sa.find_substring("substring",max_mismatches)

可能将不匹配视为错误阈值。在这种情况下，它应该能够匹配“aza”并返回“aza”子字符串的起始位置。另请注意，“abr”有2个不匹配！所以应该先退回。理想情况下，该方法应该返回所有可能的出现。

有什么想法吗？或其他解决此类问题的方法？谢谢

Answer 1

# checks whether two strings are similar,
# allowing given number of characters of difference
def similar? a, b, mismatches = 1
  a.chars.zip(b.chars).count{|ca, cb| ca != cb} <= mismatches
end

# in haystack, find similar strings to needle
def find_similar haystack, needle, mismatches = 1
  haystack.chars.each_cons(needle.length).map(&:join).select{|s|
    similar?(s, needle, mismatches)
  }
end

find_similar 'abracadabra', 'aca'
# => ["aca", "ada"] 
find_similar 'abracadabra', 'aca', 2
# => ["abr", "bra", "aca", "ada", "abr", "bra"]

随意更改similar?方法以匹配您的类似定义。

Answer 2

我们称之为不匹配的内容称为Hamming Distance，它只是字符串之间不匹配的字符数的计数（仅允许替换 - 不允许插入或删除）。

因此，可以在find_substring函数中使用Mladen的计数代码来确定字符串是否在允许的不匹配数量范围内。

然后，如果是，则可以将其返回（或者如果要全部跟踪它们，则将其添加到匹配列表中）。在检查之后，您可以进行测试以设置高或低，具体取决于它是否大于或小于比较。

以下是我更改代码的方式：

def find_substring(the_substring, n_mismatches)
#uses typical binary search
high = @suffix_array.length - 1
low = 0
while(low <= high)
  mid = (high + low) / 2
  this_suffix = @suffix_array[mid][:suffix]
  compare_len = the_substring.length-1
  comparison = this_suffix[0..compare_len]

  if n_mismatches == 0
    within_n_mismatches = comparison == the_substring
  else
    within_n_mismatches = hamming_distance(the_substring, comparison) <= n_mismatches
  end

  return @suffix_array[mid][:position] if within_n_mismatches

  if comparison > the_substring
    high = mid - 1
  else
    low = mid + 1
  end
end
return nil
end

def hamming_distance(a, b)
# from Mladen Jablanović's answer at http://stackoverflow.com/questions/5322428/finding-a-substring-while-allowing-for-mismatches-with-ruby 
a.chars.zip(b.chars).count{|ca, cb| ca != cb}
end

它会增加一些处理时间 - 相对于子字符串的大小是线性的，但我认为考虑到其余数据的大小，可能会出现这么多。我没有像我那样真正考虑过那部分，但也许你想要测试它与另一种方法：对输入字符串进行更改并多次搜索。

例如，如果您正在使用DNA，如果您的子串是“GAC”，您将搜索它，加上“AAC”和“CAC”和“TAC”（然后是第2和第3个核苷酸的组合）可能的数量应该保持足够小以适应记忆。

相反 - 在后缀数组中存储所有不匹配的可能性 - 不是真的。由于它可能已经很大，因此将它自身乘以几次会使它太大而不能很快适应内存。

之前我曾使用过这种方法 - 不完全使用后缀数组，但只是存储不匹配。

除了上面的代码之外，我还修改了一下来添加一个获取所有匹配的函数。我将其发布到one of my repositories at github。

在允许与Ruby不匹配的同时查找子字符串

2 个答案: