我想要计算2个字符串中有多少单词发生了变化。我建立了一种超级简单的野蛮方式。
before = "one two three four"
after = "one six three four five"
word_count_before = before.scan(/\w+/).size
word_count_after = after.scan(/\w+/).size
if word_count_before > word_count_after #if the string got smaller we still want a positive number when comparing the two
bigger = word_count_before
smaller = word_count_after
else
bigger = word_count_after
smaller = word_count_before
end
word_difference = bigger - smaller
问题在于它只计算添加或删除了多少个单词。使用我的方法,当2个单词发生变化时,最终结果将是1个单词发生了变化('two'=>'six'&'five'被添加)。
我无法找到一种方法来获取字符串中有多少单词已更改。但我已经看到了类似的地方的例子(更复杂的版本)。 Stackoverflow编辑建议功能有一个显示,显示之前和之后以及在帖子上更改,替换或删除的单词。此外,当提交bitbucket或git时,您可以看到提交之间文件中的更改。我只是想要计算已改变的单词数量,但这些例子可能有所帮助。
在ruby或RoR中有什么办法吗?
答案 0 :(得分:1)
before = "one two three four"
after = "one six three four five"
before, after = [before, after].map(&:split)
common = [before, after].reduce &:&
before_not_after = before - common
after_not_before = after - common
要保留除一个相等字符串之外的所有字符串,可以使用:
before, after = [before, after].map(&:split)
# after execution of the line below, before array will contain result
after.each { |e| (i = before.index(e)) && before.delete_at(i) }
请注意后者会改变数组before
。
答案 1 :(得分:0)
如果您的目标是衡量两个文本之间的差异,那么有各种算法可以做到这一点。 看看例如levenshtein
如果单词本身是相同的,而不是在文本中比较它们的位置,我可以给你这个方法,我在电子书中比较项目。这比你的样本更进一步。
“一二三四”和“一四四四”在您的实施中将是相同的,但不是与此相同。
one = "een redelijk lange tekst om na te gaan of dit programma het verschil kan maken tussen soortgelijke teksten door rekening te houden met combinaties van woorden"
two = "een redelijk lange tekst met bijna dezelfde woorden als de vorige om na te gaan of dit programma het verschil kan maken tussen soortgelijke teksten door rekening te houden met combinaties van woorden"
three = "een totaal andere tekst, ik maak hem lang genoeg om representabel te zijn en zet er enkele woorden bij die in de eerste tekst ook voorkomen"
class String
def similarities_with text, lookafter_count=2, lookbefore_count=2
r = [self.split, text.split].each.inject([]) do |r, a|
r << a.each_with_index.inject([]) do |m, (element, index)|
m << a[index-lookbefore_count..index+lookafter_count]
end
end
(r.first & r.last).reject(&:empty?).count
end
end
one.similarities_with one # 24
one.similarities_with two # 20
one.similarities_with three # 0
"one two three four".similarities_with("one six three four five", 0, 0) # 3
"one two three four".similarities_with("one six three four five", 1, 1) # 0
# and now the difference
one.similarities_with(one) - one.similarities_with(two) # 4
一些解释:该方法将字符串本身与参数中的字符串进行比较。我使用inject,所以我不必提前定义空数组。 结果(r)保留关键字之前和之后的单词组合数组。比较这两个数组,只有那些存在于两个文本中的数据都被计算并通过该方法返回。
答案 2 :(得分:0)
before = %w"one two three four" # => ["one", "two", "three", "four"]
after = %w"one six three four five" # => ["one", "six", "three", "four", "five"]
after - before # => ["six", "five"] These words were added
before - after # => ["two"] These words were removed