从多个字符串中删除重复文本

时间:2013-08-24 10:45:38

标签: ruby

我有:

a = "This is Product A with property B and propery C. Buy it now!"
b = "This is Product B with property X and propery Y. Buy it now!"
c = "This is Product C having no properties. Buy it now!"

我正在寻找一种可以做到的算法:

> magic(a, b, c)
=> ['A with property B and propery C', 
    'B with property X and propery Y', 
    'C having no properties']

我必须在1000多个文本中找到重复项。超级性能不是必须的,但会很好。

- 更新

我正在寻找一系列单词。所以如果:

d = 'This is Product D with text engraving: "Buy". Buy it now!' 

第一个“购买”不应该是重复的。我猜我必须在彼此之后使用 n 单词的阈值才能被视为重复。

3 个答案:

答案 0 :(得分:3)

def common_prefix_length(*args)
  first = args.shift
  (0..first.size).find_index { |i| args.any? { |a| a[i] != first[i] } }
end

def magic(*args)
  i = common_prefix_length(*args)
  args = args.map { |a| a[i..-1].reverse }
  i = common_prefix_length(*args)
  args.map { |a| a[i..-1].reverse }
end

a = "This is Product A with property B and propery C. Buy it now!"
b = "This is Product B with property X and propery Y. Buy it now!"
c = "This is Product C having no properties. Buy it now!"

magic(a,b,c)
# => ["A with property B and propery C",
#     "B with property X and propery Y",
#     "C having no properties"]

答案 1 :(得分:3)

您的数据

sentences = [ 
  "This is Product A with property B and propery C. Buy it now!",
  "This is Product B with property X and propery Y. Buy it now!",
  "This is Product C having no properties. Buy it now!"
]

你的魔力

def magic(data)
  prefix, postfix = 0, -1
  data.map{ |d| d[prefix] }.uniq.compact.size == 1 && prefix += 1 or break  while true
  data.map{ |d| d[postfix] }.uniq.compact.size == 1 && prefix > -postfix && postfix -= 1 or break  while true
  data.map{ |d| d[prefix..postfix] }
end

您的输出

magic(sentences)
#=> [
#=>   "A with property B and propery C",
#=>   "B with property X and propery Y",
#=>   "C having no properties"
#=> ]

或者您可以使用loop代替while true

def magic(data)
  prefix, postfix = 0, -1
  loop{ data.map{ |d| d[prefix] }.uniq.compact.size == 1 && prefix += 1 or break }
  loop{ data.map{ |d| d[postfix] }.uniq.compact.size == 1 && prefix > -postfix && postfix -= 1 or break }
  data.map{ |d| d[prefix..postfix] }
end

答案 2 :(得分:-1)

编辑:此代码有错误。只是留下我的答案以供参考,因为我不喜欢它,如果人们在被投票后删除答案。每个人都会犯错误: - )

我喜欢@fattru的方法,但觉得代码不必要地复杂。这是我的尝试:

def common_prefix_length(strings)
  i = 0
  i += 1 while strings.map{|s| s[i] }.uniq.size == 1
  i
end

def common_suffix_length(strings)
  common_prefix_length(strings.map(&:reverse))
end

def uncommon_infixes(strings)
  pl = common_prefix_length(strings)
  sl = common_suffix_length(strings)
  strings.map{|s| s[pl...-sl] }
end

由于OP可能会关注性能,我做了一个快速的基准测试:

require 'fruity'
require 'securerandom'

prefix = 'PREFIX '
suffix = ' SUFFIX'
test_data = Array.new(1000) do
  prefix + SecureRandom.hex + suffix
end

def fl00r_meth(data)
  prefix, postfix = 0, -1
  data.map{ |d| d[prefix] }.uniq.size == 1 && prefix += 1 or break  while true
  data.map{ |d| d[postfix] }.uniq.size == 1 && postfix -= 1 or break  while true
  data.map{ |d| d[prefix..postfix] }
end

def falsetru_common_prefix_length(*args)
  first = args.shift
  (0..first.size).find_index { |i| args.any? { |a| a[i] != first[i] } }
end

def falsetru_meth(*args)
  i = falsetru_common_prefix_length(*args)
  args = args.map { |a| a[i..-1].reverse }
  i = falsetru_common_prefix_length(*args)
  args.map { |a| a[i..-1].reverse }
end

def padde_common_prefix_length(strings)
  i = 0
  i += 1 while strings.map{|s| s[i] }.uniq.size == 1
  i
end

def padde_common_suffix_length(strings)
  padde_common_prefix_length(strings.map(&:reverse))
end

def padde_meth(strings)
  pl = padde_common_prefix_length(strings)
  sl = padde_common_suffix_length(strings)
  strings.map{|s| s[pl...-sl] }
end

compare do
  fl00r do
    fl00r_meth(test_data.dup)
  end

  falsetru do
    falsetru_meth(*test_data.dup)
  end

  padde do
    padde_meth(test_data.dup)
  end
end

结果如下:

Running each test once. Test will take about 1 second.
fl00r is similar to padde
padde is faster than falsetru by 30.000000000000004% ± 10.0%