Question

我想在一大堆句子中找到最重复的单词和短语。我在想的是以下解决方案：

使用每个单词，夫妻和三个词的计数来构建哈希值。
单身体重为1，体重为2，三体组体重为3（体重的目的是优先考虑短语而不是单个单词）。
从哈希中选择顶部的单词/短语。

我对此的关注是当我有一大堆句子（假设100K句子，每个句子平均100个单词）时，哈希值会很大并且会杀死我的服务器内存。

当然，我还需要清理介词之类的词语：“is”，“a”，“to”....

有什么想法？如果有帮助，我的数据库是postgres。

实施上述解决方案：

  def most_common_words_or_phrases
    singles = Hash.new(0)
    doubles = Hash.new(0)
    triplets = Hash.new(0)

    reviews.find_each do |review|
      next if review.content.empty?
      parts = review.content.split.map!(&:downcase)
      size = parts.size

      parts.each_with_index do |val, index|
        next if @@prepositions.include?(val)
        second_word = parts[index + 1] if index != size - 1
        second_word = nil if second_word.present? && (val[val.size - 1] == "." || val[val.size - 1] == "," || val[val.size - 1] == "!" || @@prepositions.include?(second_word))
        third_word = parts[index + 2] if index < size - 2
        third_word = third_word.gsub(/\p{^Alnum}/, '') if third_word.present?
        third_word = nil if second_word.blank? ||(third_word.present? &&
            (second_word[second_word.size - 1] == "." || second_word[second_word.size - 1] == "," ||
                second_word[second_word.size - 1] == "!") || @@prepositions.include?(third_word))

        singles[val] += 1

        double = val + " " + second_word.gsub(/\p{^Alnum}/, '') if second_word.present?
        doubles[double] += 2 if double.present?

        triplets[double + " " + third_word] += 3 if double.present? && third_word.present?
      end
    end

    singles.merge(doubles).merge(triplets)
  end

Answer 1

正如@bronislav在评论中提到的，最好使用Redis存储bug Hashes。

但是，哈希可能不会那么大？请记住，这些单词正在重复这些文件，因此当您要迭代源时，您只需要向哈希添加新单词。使用默认值为0的哈希：

irb(main):001:0> words=Hash.new(0)
irb(main):003:0> words['cucumber']
=> 0
irb(main):004:0> words['cucumber'] += 1   # found new word
=> 1
irb(main):005:0> words['cucumber'] += 1   # word was already there
=> 2

但是单词和三元组会更多，所以你可能仍然想要使用Redis。

在你的情况下，在将词语添加到词典之前词干是很好的。 Stemming 正在获得单词的根，因此您可以轻松配对缩写。例如，处理单词：run，runs，running as one是一件好事。我强烈建议使用treat gem（不仅仅是为了这个，还是一个很棒的自然语言处理库）：

irb(main):001:0> require 'treat'
=> true
irb(main):002:0> 'run'.stem
=> "run"
irb(main):003:0> 'runs'.stem
=> "run"
irb(main):004:0> 'running'.stem
=> "run"

要清除＆＃39;，＆＃39;，＆＃39;等等介词，请使用英语停用词列表。你可以找到它here。

Ruby - 大量句子中最常见的单词或短语

1 个答案: