Question

的小片段

我试图理解这种分裂方法的逻辑过程。

SHA1编码是十六进制的40个字符。在表达式中计算了什么样的概率？
（MAX_NUM_IMAGES_PER_CLASS + 1）的原因是什么？为什么要添加1？
为MAX_NUM_IMAGES_PER_CLASS设置不同的值会对分割质量产生影响吗？

我们可以从中获得多大的分割质量？这是拆分数据集的推荐方法吗？

# We want to ignore anything after '_nohash_' in the file name when
  # deciding which set to put an image in, the data set creator has a way of
  # grouping photos that are close variations of each other. For example
  # this is used in the plant disease data set to group multiple pictures of
  # the same leaf.
  hash_name = re.sub(r'_nohash_.*$', '', file_name)
  # This looks a bit magical, but we need to decide whether this file should
  # go into the training, testing, or validation sets, and we want to keep
  # existing files in the same set even if more files are subsequently
  # added.
  # To do that, we need a stable way of deciding based on just the file name
  # itself, so we do a hash of that and then use that to generate a
  # probability value that we use to assign it.
  hash_name_hashed = hashlib.sha1(compat.as_bytes(hash_name)).hexdigest()
  percentage_hash = ((int(hash_name_hashed, 16) %
                      (MAX_NUM_IMAGES_PER_CLASS + 1)) *
                     (100.0 / MAX_NUM_IMAGES_PER_CLASS))
  if percentage_hash < validation_percentage:
    validation_images.append(base_name)
  elif percentage_hash < (testing_percentage + validation_percentage):
    testing_images.append(base_name)
  else:
    training_images.append(base_name)

  result[label_name] = {
      'dir': dir_name,
      'training': training_images,
      'testing': testing_images,
      'validation': validation_images,
      }

Answer 1

此代码只是将文件名“随机”（但可重复）分布在多个bin上，然后将bin分为三个类别。散列中的位数无关紧要（只要它“足够”即可，这种工作大约为35）。

减少模 n +1会在[0， n ]上产生一个值，并将其乘以100 / n 显然会产生一个值在[0,100]上，它被解释为百分比。 n 为MAX_NUM_IMAGES_PER_CLASS是为了控制解释中的舍入误差不超过“一个图像”。

此策略是合理的，但看起来比实际要复杂一些（因为仍在进行舍入运算，其余的则引入了偏差-尽管数字如此之大，这是完全不可观察的）。您可以通过简单地为每个类预先计算2 ^ 160个散列的整个空间上的范围，并仅针对两个边界检查散列，来使其更简单，更准确。从概念上讲，这仍然涉及到舍入，但是只有160位才能表示浮点数，例如31％的小数。

SHA Hashing用于培训/验证/测试集拆分

1 个答案: