Question

我正在编写一个脚本，以从两个字符串中删除最长的重复子字符串。有两个字符串：a和b：

a = "Hello World: This is a test message"
b = "Good Bye: This is a test message"

由于存在重复项：: This is a test message，因此将它们从两个字符串中删除。我正在尝试实现以下输出：

"Hello World"
"Good Bye"

另一个例子是：

a = "Zoo is awesome. Hello World: This is not a test message"
b = "Zoo is not awesome. Good Bye: This is a test message"

具有预期的输出：

"Zoo is awesome. Hello World: This is not"
"Zoo is not awesome. Good Bye: This is"

我正在考虑将字符串分成子字符串数组，然后减去两个数组以获得唯一的子字符串。请告知是否有更好的方法。

Answer 1

首先，您必须找到最长的公共子字符串，然后将其减去。要找到最长的公共子字符串，您需要了解所有子字符串：

def substrings(string)
  (0..string.length-1).flat_map do |i|
    (1..string.length-i).flat_map do |j|
      string[i,j]
    end
  end
end

这是通过从索引0开始并获取一个全长子字符串，然后是一个长度为1的子字符串，依此类推，然后再移至索引1并反复重复来完成的。

这以相当任意的顺序返回它们，尽管按长度排序很简单。下一步是查看以下哪些all?个匹配项：

def lcs(*list)
  list = list.sort_by(&:length)
  subs = substrings(list.first).sort_by(&:length).reverse

  subs.find do |str|
    list.all? do |entry|
      entry.include?(str)
    end
  end
end

此处选择了最短的条目（排序顺序first），因为它必然包含最长的公用字符串。

这将为您提供要删除的子字符串，因此您可以应用它：

def uniqueify(*list)
  list_lcs = lcs(*list)

  list.map do |entry|
    entry.sub(list_lcs, '')
  end
end

随后将起作用：

a = "Hello World: This is a test message"
b = "Good Bye: This is a test message"

lcs(a,b)
# => ": This is a test message"

uniqueify(a,b)
# => ["Hello World", "Good Bye"]

Answer 2

Google有一个很棒的库，名为diff_match_patch，该库以超快的方式对两个字符串进行基于字符的比较，并且-Ruby有一个瑰宝！

require 'diff_match_patch'
longest = DiffMatchPatch.new.diff_main(a, b).    # find diffs
    select { |type, text| type == :equal }.      # select only equal pieces
    map(&:last).                                 # get just text
    max_by(&:length)                             # find the longest one
a[longest] = ''                                  # delete this piece from a
b[longest] = ''                                  #               and from b

puts a
# => Hello world
puts b
# => Good bye

Answer 3

如果目标只是删除两个字符串末尾的公共字符，我们可以编写：

def remove_common_ending(str1, str2)
  return ["", ""] if str1 == str2
  n = [str1.size, str2.size].min
  return [str1, str2] if n.zero?
  i = (1..n).find { |i| str1[-i].downcase != str2[-i].downcase }
  [str1[0..-i], str2[0..-i]]
end

remove_common_ending(str1, str2)
  #=> ["Hello World", "Good Bye"]

另一种可能的解释是，将从两个字符串中删除最长的公共子字符串。接下来是做到这一点的一种方法。我的方法类似于@tadman的方法，除了我从可能的最长公共子字符串的长度开始，然后逐渐缩短该长度，直到找到在两个字符串中都出现的子字符串为止。这样就无需寻找更短的匹配子字符串。

def longest_common_substring(str1, str2)
  return '' if str1.empty? || str2.empty?
  s1, s2 = str1.downcase, str2.downcase
  (s1, s2 = s2, s1) if s2.size < s1.size
  sz1 = s1.size
  sz1.downto(1) do |len|
    puts "Checking #{sz1-len+1} substrings of length #{len}..."
    (0..sz1-len).each do |i|
      s = s1[i, len]
      return s if s2.include?(s)
    end
  end 
end

我添加了puts语句以显示正在执行的计算。请注意，我在较长的字符串（str1）中搜索较短的字符串（str2）的子字符串。

str1 = "Hello World: This is a test message"
str2 = "Good Bye: This is a test message"  

s = longest_common_substring(str1, str2)    
Checking 1 substrings of length 32...
Checking 2 substrings of length 31...
Checking 3 substrings of length 30...
Checking 4 substrings of length 29...
Checking 5 substrings of length 28...
Checking 6 substrings of length 27...
Checking 7 substrings of length 26...
Checking 8 substrings of length 25...
Checking 9 substrings of length 24...
  #=> ": This is a test message"

r = /#{Regexp.escape(s)}/i
  #=> /:\ this\ is\ a\ test\ message/i
str1.sub(r,'') #=> "Hello World"
str2.sub(r,'') #=> "Good Bye"

如所见，在找到最长的公共子字符串之前检查的较短字符串（str2）的子字符串数为(1+10)*10/2-1 #=> 54。

Answer 4

考虑使用数组仅删除与路径匹配的子字符串，我想到了以下解决方案：

a = "Hello World: This is a test message"
b = "Good Bye: This is a test message"
# a = "Zoo is awesome. Hello World: This is not a test message"
# b = "Zoo is not awesome. Good Bye: This is a test message"

a_ary = a.split(/\b/)
b_ary = b.split(/\b/)

zipped = a_ary.reverse.zip(b_ary.reverse)
dropped = zipped.drop_while { |(a,b)| a == b }

dropped.reverse.transpose.map{|w| w.join('')}
#=> ["Hello World", "Good Bye"]
#=> ["Zoo is awesome. Hello World: This is not", "Zoo is not awesome. Good Bye: This is"]

一个班轮：

a.split(/\b/).reverse.zip(b.split(/\b/).reverse).drop_while { |(a,b)| a == b }.reverse.transpose.map{|w| w.join('')}

从两个字符串中删除重复的子字符串

4 个答案: