Question

我正在使用CNN处理图像分类问题。我有一个包含重复图像的图像数据集。当我使用此数据训练CNN时，它已经过拟合。因此，我需要删除那些重复项。

Answer 1

对于算法而言，我们很难将其称为重复项。您的副本可以是： 1.精确重复 2.几乎完全相同的副本。（图像的较小编辑等） 3.感知重复（内容相同，但视图，摄像机等不同）

1号和2号更容易解决。 No. 3是非常主观的，仍然是一个研究主题。我可以为1号和2号提供解决方案。两种解决方案均使用出色的图像哈希哈希库：https://github.com/JohannesBuchner/imagehash

完全重复可以使用感知哈希方法找到确切的重复项。 phash库在这方面非常擅长。我经常用它清洗训练数据。用法（来自github站点）非常简单：

from PIL import Image
import imagehash

# image_fns : List of training image files
img_hashes = {}

for img_fn in sorted(image_fns):
    hash = imagehash.average_hash(Image.open(image_fn))
    if hash in img_hashes:
        print( '{} duplicate of {}'.format(image_fn, img_hashes[hash]) )
    else:
        img_hashes[hash] = image_fn

几乎完全重复在这种情况下，您将必须设置阈值并比较哈希值与每个哈希值之间的距离其他。对于图像内容，必须通过反复试验来完成。

from PIL import Image
import imagehash

# image_fns : List of training image files
img_hashes = {}
epsilon = 50

for img_fn1, img_fn2 in zip(image_fns, image_fns[::-1]):
    if image_fn1 == image_fn2:
        continue

    hash1 = imagehash.average_hash(Image.open(image_fn1))
    hash2 = imagehash.average_hash(Image.open(image_fn2))
    if hash1 - hash2 < 50:
        print( '{} is near duplicate of {}'.format(image_fn1, image_fn2) )

如何在训练CNN期间删除重复项？

1 个答案: