从两个字符串中删除重复的子字符串

时间:2018-12-26 01:39:12

标签: ruby

我正在编写一个脚本,以从两个字符串中删除最长的重复子字符串。有两个字符串:ab

a = "Hello World: This is a test message"
b = "Good Bye: This is a test message"

由于存在重复项:: This is a test message,因此将它们从两个字符串中删除。我正在尝试实现以下输出:

"Hello World"
"Good Bye"

另一个例子是:

a = "Zoo is awesome. Hello World: This is not a test message"
b = "Zoo is not awesome. Good Bye: This is a test message"

具有预期的输出:

"Zoo is awesome. Hello World: This is not"
"Zoo is not awesome. Good Bye: This is"

我正在考虑将字符串分成子字符串数组,然后减去两个数组以获得唯一的子字符串。请告知是否有更好的方法。

4 个答案:

答案 0 :(得分:3)

首先,您必须找到最长的公共子字符串,然后将其减去。要找到最长的公共子字符串,您需要了解所有子字符串:

def substrings(string)
  (0..string.length-1).flat_map do |i|
    (1..string.length-i).flat_map do |j|
      string[i,j]
    end
  end
end

这是通过从索引0开始并获取一个全长子字符串,然后是一个长度为1的子字符串,依此类推,然后再移至索引1并反复重复来完成的。

这以相当任意的顺序返回它们,尽管按长度排序很简单。下一步是查看以下哪些all?个匹配项:

def lcs(*list)
  list = list.sort_by(&:length)
  subs = substrings(list.first).sort_by(&:length).reverse

  subs.find do |str|
    list.all? do |entry|
      entry.include?(str)
    end
  end
end

此处选择了最短的条目(排序顺序first),因为它必然包含最长的公用字符串。

这将为您提供要删除的子字符串,因此您可以应用它:

def uniqueify(*list)
  list_lcs = lcs(*list)

  list.map do |entry|
    entry.sub(list_lcs, '')
  end
end

随后将起作用:

a = "Hello World: This is a test message"
b = "Good Bye: This is a test message"

lcs(a,b)
# => ": This is a test message"

uniqueify(a,b)
# => ["Hello World", "Good Bye"]

答案 1 :(得分:1)

Google有一个很棒的库,名为diff_match_patch,该库以超快的方式对两个字符串进行基于字符的比较,并且-Ruby有一个瑰宝!

require 'diff_match_patch'
longest = DiffMatchPatch.new.diff_main(a, b).    # find diffs
    select { |type, text| type == :equal }.      # select only equal pieces
    map(&:last).                                 # get just text
    max_by(&:length)                             # find the longest one
a[longest] = ''                                  # delete this piece from a
b[longest] = ''                                  #               and from b

puts a
# => Hello world
puts b
# => Good bye

答案 2 :(得分:0)

如果目标只是删除两个字符串末尾的公共字符,我们可以编写:

def remove_common_ending(str1, str2)
  return ["", ""] if str1 == str2
  n = [str1.size, str2.size].min
  return [str1, str2] if n.zero?
  i = (1..n).find { |i| str1[-i].downcase != str2[-i].downcase }
  [str1[0..-i], str2[0..-i]]
end

remove_common_ending(str1, str2)
  #=> ["Hello World", "Good Bye"] 

另一种可能的解释是,将从两个字符串中删除最长的公共子字符串。接下来是做到这一点的一种方法。我的方法类似于@tadman的方法,除了我从可能的最长公共子字符串的长度开始,然后逐渐缩短该长度,直到找到在两个字符串中都出现的子字符串为止。这样就无需寻找更短的匹配子字符串。

def longest_common_substring(str1, str2)
  return '' if str1.empty? || str2.empty?
  s1, s2 = str1.downcase, str2.downcase
  (s1, s2 = s2, s1) if s2.size < s1.size
  sz1 = s1.size
  sz1.downto(1) do |len|
    puts "Checking #{sz1-len+1} substrings of length #{len}..."
    (0..sz1-len).each do |i|
      s = s1[i, len]
      return s if s2.include?(s)
    end
  end 
end    

我添加了puts语句以显示正在执行的计算。请注意,我在较长的字符串(str1)中搜索较短的字符串(str2)的子字符串。

str1 = "Hello World: This is a test message"
str2 = "Good Bye: This is a test message"  

s = longest_common_substring(str1, str2)    
Checking 1 substrings of length 32...
Checking 2 substrings of length 31...
Checking 3 substrings of length 30...
Checking 4 substrings of length 29...
Checking 5 substrings of length 28...
Checking 6 substrings of length 27...
Checking 7 substrings of length 26...
Checking 8 substrings of length 25...
Checking 9 substrings of length 24...
  #=> ": This is a test message"

r = /#{Regexp.escape(s)}/i
  #=> /:\ this\ is\ a\ test\ message/i
str1.sub(r,'') #=> "Hello World"
str2.sub(r,'') #=> "Good Bye"

如所见,在找到最长的公共子字符串之前检查的较短字符串(str2)的子字符串数为(1+10)*10/2-1 #=> 54

答案 3 :(得分:0)

考虑使用数组仅删除与路径匹配的子字符串,我想到了以下解决方案:

a = "Hello World: This is a test message"
b = "Good Bye: This is a test message"
# a = "Zoo is awesome. Hello World: This is not a test message"
# b = "Zoo is not awesome. Good Bye: This is a test message"

a_ary = a.split(/\b/)
b_ary = b.split(/\b/)

zipped = a_ary.reverse.zip(b_ary.reverse)
dropped = zipped.drop_while { |(a,b)| a == b }

dropped.reverse.transpose.map{|w| w.join('')}
#=> ["Hello World", "Good Bye"]
#=> ["Zoo is awesome. Hello World: This is not", "Zoo is not awesome. Good Bye: This is"]

一个班轮:

a.split(/\b/).reverse.zip(b.split(/\b/).reverse).drop_while { |(a,b)| a == b }.reverse.transpose.map{|w| w.join('')}