Question

所以，让我说我有这些文本：

文本1：

绝对服从被称为主宰的虫族集体感知。 Overmind指导Swarm中每个虫族生物的动作，通过较小的感知者的层次结构起作用。

文本2：

虫群中的虫族生物，通过较小的感知者层次结构运作。虽然主宰主要是由其消费和吸收的欲望驱动的

文字3

当虫族第一次到达Koprulu区时，他们通过绝对服从被称为主宰的虫族集体感觉而统一起来。 Overmind指导Swarm中每个虫族生物的动作，通过较小的感知者的层次结构起作用。虽然主宰主要是由于它渴望消耗和吸收先进的神族种族，但它在人类中找到了有用但尚未开发的材料。

现在，Text1的结尾和text2的开头重叠，所以我们说文本块不是唯一的。类似地，使用Text3，Text1可以在里面找到（以及Text2），所以由于重叠，这也不是唯一的。

所以，我的问题：

如何撰写可以查看连续字母或单词并确定唯一性的内容？理想情况下，我希望这样的方法返回一些值，表示相似度 - 可能是两个文本块大小的平均值匹配的单词数。当它返回0时，测试的两个文本应该是完全唯一的。

在使用Ruby的字符串方法时，我遇到了一些问题。

首先，我开始尝试找到两个字符串的交集。

>> a = "nt version, there are no ch"  
>> b = "he current versi"  
>> (a.chars.to_a & b.chars.to_a).join  
=> "nt versihc"

上述方法的问题在于它只是在结果的末尾添加了共同的字母（我们失去了字符的顺序），这将使得难以测试唯一性。但我不认为交叉是开始这种相似性比较的最佳方式。在被比较的两个文本中可以存在任意数量的单词组合。所以也许如果我创建了一系列连续的相似性......但是这需要我们在尝试短语长度时遍历其中一个文本。

我想我真的只是不知道从哪里开始，并且以一种有效而不是O(n^too_high)的方式。

Answer 1

这是一个Ruby implementation of the Levenshtein distance algorithm。安装gem之后，您可以像这样使用它：

require 'rubygems'
require 'Text'

t1 = "absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients."

t2 = "zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate"

puts Text::Levenshtein.distance(t1,t2)

Answer 2

我相信你正在寻找的是Longest Common Substring problem，即给定两个字符串，找到它们共有的最长子串的问题。该链接指向维基百科页面，该页面将帮助您了解域并包含在 O（nm）时间运行的算法的伪代码示例。

此外，Wikibooks的算法实现书有an implementation in Ruby。它包含lcs_size方法，可能就是您所需要的。简而言之，如果lcs_size（text1，text2）返回4，这意味着text1和text2几乎没有共同的连续文本，可能只是一个单词，但如果它返回，比方说， 40，他们可能有一个完整的句子。

希望这有用！

Answer 3

你的问题不是Ruby。这是算法。你可以将每个文本拆分成单词，然后运行最小距离算法（http://en.wikipedia.org/wiki/Levenshtein_distance）来获得它。

数字越小，文本越相似。

Answer 4

这可以改进很多，但这是一个想法：

txt1 = "absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients."
txt2 = "zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate"

def txt_to_ary(txt)
    txt.gsub(/\.|,/, ' ').downcase.split(/\s+/)
end

def longest_match(txt1, txt2)
    longest = 0
    txt1.each_with_index do |w1, i|
        txt2.each_with_index do |w2, j|
            next unless w1 == w2
            k = 0
            k += 1 while txt1[i+k] == txt2[j+k]
            longest = k if k > longest          
        end
    end
    longest
end

txt1 = txt_to_ary( txt1 )
txt2 = txt_to_ary( txt2 )

puts longest_match(txt1, txt2) #=>12

Answer 5

amatch宝石非常适合字符串比较。

Ruby：如何测试两个文本块之间的相似性？

文本1：

文本2：

文字3

5 个答案: