Question

鉴于1000字的文字，检查10,000字的字典的有效方法是什么？我想计算非独特比赛的数量。

一个想法是将字典存储为哈希。但是，我必须检查每个单词与哈希值，这将是1,000次操作。这看起来效率不高。

另一个想法是Postgres文本搜索。但是有可能在一个查询中执行此检查吗？

另一个想法是将单词存储在Memcache或Redis数据库中，但这需要1000个查询并且速度很慢。

那么，是否有更有效的解决方案？

使用Ruby。

编辑：为a：

添加基准

Cary断言dict_set更快是正确的：

aw.length
=> 250
dw.length
=> 1233
dict_set.length
=> 1223
t = Time.now; 1000.times{ aw & dw }; Time.now - t
=> 0.682465
t = Time.now; 1000.times{ aw.count{ |w| dict_set.include? w }}; Time.now - t
=> 0.063375

所以，Set#include?似乎非常有效。

Answer 1

假设：

text = "The quick brown fox and the quick brown bear jumped over the lazy dog"

和

dictionary = ["dog", "lazy", "quick", "sloth", "the"]

我们首先将dictionary转换为集合：

require 'set'
dict_set = dictionary.to_set
  #=> #<Set: {"dog", "lazy", "quick", "sloth", "the"}>

并将text转换为一系列向下字词：

words = text.downcase.split
  #=> ["the", "quick", "brown", "fox", "the", "and", "quick",
  #    "brown", "bear", "jumped", "over", "the", "lazy", "dog"]

以下是一些计算text中dictionary中单词数量的方法。

＃1：简单计算

words.count { |w| dict_set.include?(w) }
  #=> 7

＃2：将相同的单词和计数分组

words.group_by(&:itself).reduce(0) { |tot,(k,v)|
  tot + ((dict_set.include?(k)) ? v.size : 0) }  
  #=> 7

Object#itself在v2.2中引入。对于早期版本，请替换：

group_by(&:itself)

与

group_by { |w| w }

步骤：

h = words.group_by(&:itself)
  #=> {"the"  =>["the", "the", "the"],
  #    "quick"=>["quick", "quick"],
  #    "brown"=>["brown", "brown"],
  #    "fox"=>["fox"],
  #    ...
  #    "dog"=>["dog"]} 
h.reduce(0) { |tot,(k,v)| tot + ((dict_set.include?(k)) ? v.size : 0) }
  #=> 7}

考虑到Set#include?非常快，我希望＃1通常最快。也就是说，我怀疑将相同单词分组的时间少于字典查找中的节省。

如何有效地检查Ruby中单词列表的文本？

1 个答案: