SHA Hashing用于培训/验证/测试集拆分

时间:2017-01-31 10:16:13

标签: python machine-learning tensorflow sha

以下是full code

的小片段

我试图理解这种分裂方法的逻辑过程。

  • SHA1编码是十六进制的40个字符。在表达式中计算了什么样的概率?
  • (MAX_NUM_IMAGES_PER_CLASS + 1)的原因是什么?为什么要添加1?
  • 为MAX_NUM_IMAGES_PER_CLASS设置不同的值会对分割质量产生影响吗?
  • 我们可以从中获得多大的分割质量?这是拆分数据集的推荐方法吗?

    # We want to ignore anything after '_nohash_' in the file name when
      # deciding which set to put an image in, the data set creator has a way of
      # grouping photos that are close variations of each other. For example
      # this is used in the plant disease data set to group multiple pictures of
      # the same leaf.
      hash_name = re.sub(r'_nohash_.*$', '', file_name)
      # This looks a bit magical, but we need to decide whether this file should
      # go into the training, testing, or validation sets, and we want to keep
      # existing files in the same set even if more files are subsequently
      # added.
      # To do that, we need a stable way of deciding based on just the file name
      # itself, so we do a hash of that and then use that to generate a
      # probability value that we use to assign it.
      hash_name_hashed = hashlib.sha1(compat.as_bytes(hash_name)).hexdigest()
      percentage_hash = ((int(hash_name_hashed, 16) %
                          (MAX_NUM_IMAGES_PER_CLASS + 1)) *
                         (100.0 / MAX_NUM_IMAGES_PER_CLASS))
      if percentage_hash < validation_percentage:
        validation_images.append(base_name)
      elif percentage_hash < (testing_percentage + validation_percentage):
        testing_images.append(base_name)
      else:
        training_images.append(base_name)
    
      result[label_name] = {
          'dir': dir_name,
          'training': training_images,
          'testing': testing_images,
          'validation': validation_images,
          }
    

1 个答案:

答案 0 :(得分:1)

此代码只是将文件名“随机”(但可重复)分布在多个bin上,然后将bin分为三个类别。散列中的位数无关紧要(只要它“足够”即可,这种工作大约为35)。

减少模 n +1会在[0, n ]上产生一个值,并将其乘以100 / n 显然会产生一个值在[0,100]上,它被解释为百分比。 n MAX_NUM_IMAGES_PER_CLASS是为了控制解释中的舍入误差不超过“一个图像”。

此策略是合理的,但看起来比实际要复杂一些(因为仍在进行舍入运算,其余的则引入了偏差-尽管数字如此之大,这是完全不可观察的)。您可以通过简单地为每个类预先计算2 ^ 160个散列的整个空间上的范围,并仅针对两个边界检查散列,来使其更简单,更准确。从概念上讲,这仍然涉及到舍入,但是只有160位才能表示浮点数,例如31%的小数。