Question

我正在处理一堆数据而我还没有将重复检查器编码到数据处理器中，所以我预计会发生重复。我运行了以下SQL查询：

SELECT     body, COUNT(body) AS dup_count 
FROM         comments
GROUP BY body
HAVING     (COUNT(body) > 1)

获取重复列表。看着这个我发现这些重复有多个哈希。评论的最短字符串是"[deleted]"。所以让我们以此为例。在我的数据库中，有九个评论实例为"[deleted]"，在我的数据库中，这产生了1169143752200809218和1738115474508091027的散列.116次被发现6次，173次被发现3次。但是，当我在IRB中运行它时，我得到以下结果：

a = '[deleted]'.hash # => 811866697208321010

以下是我用来生成哈希的代码：

def comment_and_hash(chunk)     
  comment = chunk.at_xpath('*/span[@class="comment"]').text ##Get Comment##
  hash = comment.hash
  return comment,hash
end

我已经确认我的代码中的任何其他位置都没有触及评论。这是我的datamapper类。

class Comment

    include DataMapper::Resource

    property :uid       , Serial
    property :author    , String
    property :date      , Date
    property :body      , Text
    property :arank     , Float 
    property :srank     , Float 
    property :parent    , Integer #Should Be UID of another comment or blank if parent
    property :value     , Integer #Hash to prevent duplicates from occurring

end

我是否正确假设字符串上的.hash每次在同一个字符串上调用时都会返回相同的值？

假设我的字符串由"[deleted]"组成，哪个值是正确的值？

有没有办法在ruby中有不同的字符串，但SQL会将它们视为相同的字符串？这似乎是解释为什么会发生这种情况的最合理的解释，但我真的是在黑暗中拍摄。

Answer 1

如果你跑

ruby -e "puts '[deleted]'.hash"

几次，你会发现价值不同。实际上，只要Ruby进程处于活动状态，哈希值就只保持不变。原因是String#hash以随机值播种。 rb_str_hash（C实现函数）使用rb_hash_start，它使用这个随机种子，每次生成Ruby时都会初始化。

你可以使用诸如Zlib#crc32之类的CRC来达到你的目的，或者你可能想要使用OpenSSL::Digest的一个消息摘要，虽然后者因为检测到你可能赢得的重复项而过度杀伤需要安全属性。

Answer 2

我使用以下内容创建在时间和流程上保持一致的String #hash替代方案

require 'zlib'

def generate_id(label)
  Zlib.crc32(label.to_s) % (2 ** 30 - 1)
end

Answer 3

Ruby故意让String.hash在不同的会话中产生不同的值：Why is Ruby String.hash inconsistent across machines?

我是否误解了Ruby中的String #hash？

3 个答案: