我有:
a = "This is Product A with property B and propery C. Buy it now!"
b = "This is Product B with property X and propery Y. Buy it now!"
c = "This is Product C having no properties. Buy it now!"
我正在寻找一种可以做到的算法:
> magic(a, b, c)
=> ['A with property B and propery C',
'B with property X and propery Y',
'C having no properties']
我必须在1000多个文本中找到重复项。超级性能不是必须的,但会很好。
- 更新
我正在寻找一系列单词。所以如果:
d = 'This is Product D with text engraving: "Buy". Buy it now!'
第一个“购买”不应该是重复的。我猜我必须在彼此之后使用 n 单词的阈值才能被视为重复。
答案 0 :(得分:3)
def common_prefix_length(*args)
first = args.shift
(0..first.size).find_index { |i| args.any? { |a| a[i] != first[i] } }
end
def magic(*args)
i = common_prefix_length(*args)
args = args.map { |a| a[i..-1].reverse }
i = common_prefix_length(*args)
args.map { |a| a[i..-1].reverse }
end
a = "This is Product A with property B and propery C. Buy it now!"
b = "This is Product B with property X and propery Y. Buy it now!"
c = "This is Product C having no properties. Buy it now!"
magic(a,b,c)
# => ["A with property B and propery C",
# "B with property X and propery Y",
# "C having no properties"]
答案 1 :(得分:3)
您的数据
sentences = [
"This is Product A with property B and propery C. Buy it now!",
"This is Product B with property X and propery Y. Buy it now!",
"This is Product C having no properties. Buy it now!"
]
你的魔力
def magic(data)
prefix, postfix = 0, -1
data.map{ |d| d[prefix] }.uniq.compact.size == 1 && prefix += 1 or break while true
data.map{ |d| d[postfix] }.uniq.compact.size == 1 && prefix > -postfix && postfix -= 1 or break while true
data.map{ |d| d[prefix..postfix] }
end
您的输出
magic(sentences)
#=> [
#=> "A with property B and propery C",
#=> "B with property X and propery Y",
#=> "C having no properties"
#=> ]
或者您可以使用loop
代替while true
def magic(data)
prefix, postfix = 0, -1
loop{ data.map{ |d| d[prefix] }.uniq.compact.size == 1 && prefix += 1 or break }
loop{ data.map{ |d| d[postfix] }.uniq.compact.size == 1 && prefix > -postfix && postfix -= 1 or break }
data.map{ |d| d[prefix..postfix] }
end
答案 2 :(得分:-1)
编辑:此代码有错误。只是留下我的答案以供参考,因为我不喜欢它,如果人们在被投票后删除答案。每个人都会犯错误: - )
我喜欢@fattru的方法,但觉得代码不必要地复杂。这是我的尝试:
def common_prefix_length(strings)
i = 0
i += 1 while strings.map{|s| s[i] }.uniq.size == 1
i
end
def common_suffix_length(strings)
common_prefix_length(strings.map(&:reverse))
end
def uncommon_infixes(strings)
pl = common_prefix_length(strings)
sl = common_suffix_length(strings)
strings.map{|s| s[pl...-sl] }
end
由于OP可能会关注性能,我做了一个快速的基准测试:
require 'fruity'
require 'securerandom'
prefix = 'PREFIX '
suffix = ' SUFFIX'
test_data = Array.new(1000) do
prefix + SecureRandom.hex + suffix
end
def fl00r_meth(data)
prefix, postfix = 0, -1
data.map{ |d| d[prefix] }.uniq.size == 1 && prefix += 1 or break while true
data.map{ |d| d[postfix] }.uniq.size == 1 && postfix -= 1 or break while true
data.map{ |d| d[prefix..postfix] }
end
def falsetru_common_prefix_length(*args)
first = args.shift
(0..first.size).find_index { |i| args.any? { |a| a[i] != first[i] } }
end
def falsetru_meth(*args)
i = falsetru_common_prefix_length(*args)
args = args.map { |a| a[i..-1].reverse }
i = falsetru_common_prefix_length(*args)
args.map { |a| a[i..-1].reverse }
end
def padde_common_prefix_length(strings)
i = 0
i += 1 while strings.map{|s| s[i] }.uniq.size == 1
i
end
def padde_common_suffix_length(strings)
padde_common_prefix_length(strings.map(&:reverse))
end
def padde_meth(strings)
pl = padde_common_prefix_length(strings)
sl = padde_common_suffix_length(strings)
strings.map{|s| s[pl...-sl] }
end
compare do
fl00r do
fl00r_meth(test_data.dup)
end
falsetru do
falsetru_meth(*test_data.dup)
end
padde do
padde_meth(test_data.dup)
end
end
结果如下:
Running each test once. Test will take about 1 second.
fl00r is similar to padde
padde is faster than falsetru by 30.000000000000004% ± 10.0%