Question

我正在解析多个网站并尝试构建类似于以下内容的哈希：

"word" => [[01.html, 2], [02.html, 7], [03.html, 4]]

其中word是索引中的给定单词，每个子列表中的第一个值是找到它的文件，第二个值是该给定文件中出现的次数。

我遇到了一个问题，它不是在值列表中附加["02.html", 7]，而是为＆＃34; word＆＃34;创建一个全新的条目。并将["02.html", 7]放在哈希的末尾。这导致我基本上为我的所有网站提供了相应的索引，而不是给我一个主索引。

这是我的代码：

for token in tokens
   if !invindex.include?(token)
     invindex[token] = [[doc_name, 1]] #adds the word to the hash with the doc name and  occurrence of 1
   else
     for list in invindex[token]
       if list[0] == doc_name
         list[1] += 1 #adds one to the occurrence with the same doc_name
       else
         invindex[token].insert([doc_name, 1]) #this SHOULD append the doc name and initial occurrence inside the word's value list since the word is already in the hash
       end
     end
   end
 end
end

希望它有点简单，当我在纸上追踪时，我只是错过了一些东西。

Answer 1

你真的需要一个包含数组数组的哈希吗？

使用嵌套哈希

可以更好地描述这一点

invindex = {
  "word" => { '01.html' => 2, '02.html' => 7, '03.html' => 4 },
  "other" => { '01.html' => 1, '02.html' => 17, '04.html' => 4 }
}

可以使用像

这样的哈希工厂轻松填充

invindex = Hash.new { |h,k| h[k] = Hash.new {|hh,kk| hh[kk] = 0} }
tokens.each do |token|
  invindex[token][doc_name] += 1
end

现在，如果你绝对需要你提到的格式，你可以通过简单的迭代从描述的invindex获得它

result = {}
invindex.each {|k,v| result[k] = v.to_a }

Answer 2

假设：

arr = %w| 01.html 02.html 03.html 02.html 03.html 03.html |
  #=> ["01.html", "02.html", "03.html", "02.html", "03.html", "03.html"]

是索引中给定单词的文件数组。然后通过构造计数哈希：

来给出哈希中该单词的值

h = arr.each_with_object(Hash.new(0)) { |s,h| h[s] += 1 }
  #=> {"01.html"=>1, "02.html"=>2, "03.html"=>3}

然后将其转换为数组：

h.to_a
  #=> [["01.html", 1], ["02.html", 2], ["03.html", 3]]

所以你可以写：

arr.each_with_object(Hash.new(0)) { |s,h| h[s] += 1 }.to_a

Hash::new的默认值为零。这意味着如果构造的哈希h没有键s，则h[s]返回零。在那种情况下：

h[s] += 1
  #=> h[s] = h[s] + 1
  #        = 0 + 1 = 1

当s中的arr的相同值传递给块时：

h[s] += 1
  #=> h[s] = h[s] + 1
  #        = 1 + 1 = 2

您可以考虑将索引中每个单词的值设为散列h是否更好。

Answer 3

我遇到了一个问题，而不是附加[＆＃34; 02.html＆＃34;，7] 在值列表中，它为＆＃34; word＆＃34;创建了一个全新的条目。和将[＆＃34; 02.html＆＃34;，7]放在哈希的末尾。

我没有看到：

invindex = {
  word1: [ 
    ['01.html', 2],
  ]
}

tokens = %i[
  word1
  word2
  word3
]

doc_name = '02.html'

tokens.each do |token|
  if !invindex.include?(token)
    invindex[token] = [[doc_name, 1]] #adds the word to the hash with the doc name and  occurrence of 1
  else
    invindex[token].each do |list|
      if list[0] == doc_name
        list[1] += 1 #adds one to the occurrence with the same doc_name
      else
        invindex[token].insert([doc_name, 1]) #this SHOULD append the doc name and initial occurrence inside the word's value list since the word is already in the hash
      end
    end
  end

end

p invindex

--output:--
{:word1=>[["01.html", 2]], :word2=>[["02.html", 1]], :word3=>[["02.html", 1]]}

invindex[token].insert([doc_name, 1]) #this SHOULD append the doc name

都能跟得上：

invindex = {
  word: [ 
    ['01.html', 2],
  ]
}

token = :word
doc_name = '02.html'

invindex[token].insert([doc_name, 7])
p invindex
invindex[token].insert(-1, ["02.html", 7])
p invindex

--output:--
{:word=>[["01.html", 2]]}
{:word=>[["01.html", 2], ["02.html", 7]]}

Array#insert()要求您指定索引作为第一个参数。通常，当您想要在末尾追加某些内容时，请使用<<：

invindex = {
  word: [ 
    ['01.html', 2],
  ]
}

token = :word
doc_name = '02.html'

invindex[token] << [doc_name, 7]
p invindex

--output:--
{:word=>[["01.html", 2], ["02.html", 7]]}

for token in tokens

Rubyists不使用for-in循环，因为for-in循环调用each()，因此rubyists直接调用each()：

tokens.each do |token|
  ...
end

最后，indenting in ruby是2个空格 - 不是3个空格，不是1个空格，不是4个空格。这是2个空间。

将所有内容应用到您的代码中：

invindex = {
  word1: [ 
    ['01.html', 2],
  ]
}

tokens = %i[
  word1
  word2
  word3
]

doc_name = '01.html'

tokens.each do |token|
  if !invindex.include?(token)
    invindex[token] = [[doc_name, 1]] #adds the word to the hash with the doc name and  occurrence of 1
  else
    invindex[token].each do |list|
      if list[0] == doc_name
        list[1] += 1 #adds one to the occurrence with the same doc_name
      else
        invindex[token] << [doc_name, 1] #this SHOULD append the doc name and initial occurrence inside the word's value list since the word is already in the hash
      end
    end
  end

end

p invindex

--output:--
{:word1=>[["01.html", 3]], :word2=>[["01.html", 1]], :word3=>[["01.html", 1]]}

然而，仍然存在一个问题，这是因为您正在更改正在逐步完成的阵列 - 这是计算机编程中的一个禁忌：

   invindex[token].each do |list|
      if list[0] == doc_name
        list[1] += 1 #adds one to the occurrence with the same doc_name
      else
        invindex[token] << [doc_name, 1]  #***PROBLEM***

看看会发生什么：

invindex = {
  word1: [ 
    ['01.html', 2],
  ]
}

tokens = %i[
  word1
  word2
  word3
]

%w[ 01.html 02.html].each do |doc_name|

  tokens.each do |token|
    if !invindex.include?(token)
      invindex[token] = [[doc_name, 1]] #adds the word to the hash with the doc name and  occurrence of 1
    else
      invindex[token].each do |list|
        if list[0] == doc_name
          list[1] += 1 #adds one to the occurrence with the same doc_name
        else
          invindex[token] << [doc_name, 1] #this SHOULD append the doc name and initial occurrence inside the word's value list since the word is already in the hash
        end
      end
    end

  end
end

p invindex

--output:--
{:word1=>[["01.html", 3], ["02.html", 2]], :word2=>[["01.html", 1], ["02.html", 2]], :word3=>[["01.html", 1], ["02.html", 2]]}

问题1：每次您正在检查的子阵列都不包含[doc_name, 1]时，您不想插入doc_name - 您只想在检查完所有子数组后插入[doc_name, 1]，并且找不到doc_name。如果您使用起始哈希运行上面的示例：

invindex = {
  word1: [ 
    ['01.html', 2],
    ['02.html', 7],
  ]
}

......你会发现输出更糟糕。

问题2：在逐步执行数组时向数组附加[doc_name, 1]意味着当循环结束时，[doc-name, 1]也将被检查数组 - 然后你的循环将其计数增加到2.规则是：不要改变你正在踩过的数组，因为会发生不好的事情。

附加到哈希

3 个答案: