我正在编写一个脚本,以从两个字符串中删除最长的重复子字符串。有两个字符串:a
和b
:
a = "Hello World: This is a test message"
b = "Good Bye: This is a test message"
由于存在重复项:: This is a test message
,因此将它们从两个字符串中删除。我正在尝试实现以下输出:
"Hello World"
"Good Bye"
另一个例子是:
a = "Zoo is awesome. Hello World: This is not a test message"
b = "Zoo is not awesome. Good Bye: This is a test message"
具有预期的输出:
"Zoo is awesome. Hello World: This is not"
"Zoo is not awesome. Good Bye: This is"
我正在考虑将字符串分成子字符串数组,然后减去两个数组以获得唯一的子字符串。请告知是否有更好的方法。
答案 0 :(得分:3)
首先,您必须找到最长的公共子字符串,然后将其减去。要找到最长的公共子字符串,您需要了解所有子字符串:
def substrings(string)
(0..string.length-1).flat_map do |i|
(1..string.length-i).flat_map do |j|
string[i,j]
end
end
end
这是通过从索引0开始并获取一个全长子字符串,然后是一个长度为1的子字符串,依此类推,然后再移至索引1并反复重复来完成的。
这以相当任意的顺序返回它们,尽管按长度排序很简单。下一步是查看以下哪些all?
个匹配项:
def lcs(*list)
list = list.sort_by(&:length)
subs = substrings(list.first).sort_by(&:length).reverse
subs.find do |str|
list.all? do |entry|
entry.include?(str)
end
end
end
此处选择了最短的条目(排序顺序first
),因为它必然包含最长的公用字符串。
这将为您提供要删除的子字符串,因此您可以应用它:
def uniqueify(*list)
list_lcs = lcs(*list)
list.map do |entry|
entry.sub(list_lcs, '')
end
end
随后将起作用:
a = "Hello World: This is a test message"
b = "Good Bye: This is a test message"
lcs(a,b)
# => ": This is a test message"
uniqueify(a,b)
# => ["Hello World", "Good Bye"]
答案 1 :(得分:1)
Google有一个很棒的库,名为diff_match_patch
,该库以超快的方式对两个字符串进行基于字符的比较,并且-Ruby有一个瑰宝!
require 'diff_match_patch'
longest = DiffMatchPatch.new.diff_main(a, b). # find diffs
select { |type, text| type == :equal }. # select only equal pieces
map(&:last). # get just text
max_by(&:length) # find the longest one
a[longest] = '' # delete this piece from a
b[longest] = '' # and from b
puts a
# => Hello world
puts b
# => Good bye
答案 2 :(得分:0)
如果目标只是删除两个字符串末尾的公共字符,我们可以编写:
def remove_common_ending(str1, str2)
return ["", ""] if str1 == str2
n = [str1.size, str2.size].min
return [str1, str2] if n.zero?
i = (1..n).find { |i| str1[-i].downcase != str2[-i].downcase }
[str1[0..-i], str2[0..-i]]
end
remove_common_ending(str1, str2)
#=> ["Hello World", "Good Bye"]
另一种可能的解释是,将从两个字符串中删除最长的公共子字符串。接下来是做到这一点的一种方法。我的方法类似于@tadman的方法,除了我从可能的最长公共子字符串的长度开始,然后逐渐缩短该长度,直到找到在两个字符串中都出现的子字符串为止。这样就无需寻找更短的匹配子字符串。
def longest_common_substring(str1, str2)
return '' if str1.empty? || str2.empty?
s1, s2 = str1.downcase, str2.downcase
(s1, s2 = s2, s1) if s2.size < s1.size
sz1 = s1.size
sz1.downto(1) do |len|
puts "Checking #{sz1-len+1} substrings of length #{len}..."
(0..sz1-len).each do |i|
s = s1[i, len]
return s if s2.include?(s)
end
end
end
我添加了puts
语句以显示正在执行的计算。请注意,我在较长的字符串(str1
)中搜索较短的字符串(str2
)的子字符串。
str1 = "Hello World: This is a test message"
str2 = "Good Bye: This is a test message"
s = longest_common_substring(str1, str2)
Checking 1 substrings of length 32...
Checking 2 substrings of length 31...
Checking 3 substrings of length 30...
Checking 4 substrings of length 29...
Checking 5 substrings of length 28...
Checking 6 substrings of length 27...
Checking 7 substrings of length 26...
Checking 8 substrings of length 25...
Checking 9 substrings of length 24...
#=> ": This is a test message"
r = /#{Regexp.escape(s)}/i
#=> /:\ this\ is\ a\ test\ message/i
str1.sub(r,'') #=> "Hello World"
str2.sub(r,'') #=> "Good Bye"
如所见,在找到最长的公共子字符串之前检查的较短字符串(str2
)的子字符串数为(1+10)*10/2-1 #=> 54
。
答案 3 :(得分:0)
考虑使用数组仅删除与路径匹配的子字符串,我想到了以下解决方案:
a = "Hello World: This is a test message"
b = "Good Bye: This is a test message"
# a = "Zoo is awesome. Hello World: This is not a test message"
# b = "Zoo is not awesome. Good Bye: This is a test message"
a_ary = a.split(/\b/)
b_ary = b.split(/\b/)
zipped = a_ary.reverse.zip(b_ary.reverse)
dropped = zipped.drop_while { |(a,b)| a == b }
dropped.reverse.transpose.map{|w| w.join('')}
#=> ["Hello World", "Good Bye"]
#=> ["Zoo is awesome. Hello World: This is not", "Zoo is not awesome. Good Bye: This is"]
一个班轮:
a.split(/\b/).reverse.zip(b.split(/\b/).reverse).drop_while { |(a,b)| a == b }.reverse.transpose.map{|w| w.join('')}