我想在一大堆句子中找到最重复的单词和短语。我在想的是以下解决方案:
我对此的关注是当我有一大堆句子(假设100K句子,每个句子平均100个单词)时,哈希值会很大并且会杀死我的服务器内存。
当然,我还需要清理介词之类的词语:“is”,“a”,“to”....
有什么想法?如果有帮助,我的数据库是postgres。
实施上述解决方案:
def most_common_words_or_phrases
singles = Hash.new(0)
doubles = Hash.new(0)
triplets = Hash.new(0)
reviews.find_each do |review|
next if review.content.empty?
parts = review.content.split.map!(&:downcase)
size = parts.size
parts.each_with_index do |val, index|
next if @@prepositions.include?(val)
second_word = parts[index + 1] if index != size - 1
second_word = nil if second_word.present? && (val[val.size - 1] == "." || val[val.size - 1] == "," || val[val.size - 1] == "!" || @@prepositions.include?(second_word))
third_word = parts[index + 2] if index < size - 2
third_word = third_word.gsub(/\p{^Alnum}/, '') if third_word.present?
third_word = nil if second_word.blank? ||(third_word.present? &&
(second_word[second_word.size - 1] == "." || second_word[second_word.size - 1] == "," ||
second_word[second_word.size - 1] == "!") || @@prepositions.include?(third_word))
singles[val] += 1
double = val + " " + second_word.gsub(/\p{^Alnum}/, '') if second_word.present?
doubles[double] += 2 if double.present?
triplets[double + " " + third_word] += 3 if double.present? && third_word.present?
end
end
singles.merge(doubles).merge(triplets)
end
答案 0 :(得分:0)
正如@bronislav在评论中提到的,最好使用Redis存储bug Hashes。
但是,哈希可能不会那么大?请记住,这些单词正在重复这些文件,因此当您要迭代源时,您只需要向哈希添加新单词。使用默认值为0的哈希:
irb(main):001:0> words=Hash.new(0)
irb(main):003:0> words['cucumber']
=> 0
irb(main):004:0> words['cucumber'] += 1 # found new word
=> 1
irb(main):005:0> words['cucumber'] += 1 # word was already there
=> 2
但是单词和三元组会更多,所以你可能仍然想要使用Redis。
在你的情况下,在将词语添加到词典之前词干是很好的。 Stemming 正在获得单词的根,因此您可以轻松配对缩写。例如,处理单词:run,runs,running as one是一件好事。我强烈建议使用treat gem(不仅仅是为了这个,还是一个很棒的自然语言处理库):
irb(main):001:0> require 'treat'
=> true
irb(main):002:0> 'run'.stem
=> "run"
irb(main):003:0> 'runs'.stem
=> "run"
irb(main):004:0> 'running'.stem
=> "run"
要清除&#39;,&#39;,&#39;等等介词,请使用英语停用词列表。你可以找到它here。