Question

所以我有600,000张以上的图片。我的估计是其中大约5-10％的数据已损坏。我正在生成一个日志，确切说明了该图像与哪个图像有关。

使用Python，到目前为止，我的方法是：


# Check number of partitions in rdd
print(data.rdd.getNumberPartitions())

# Coalesce it, this function adjust the number partition count.
data.rdd.coalesce(1).saveAsTextFile("./your_file")

最初的200-250K图像处理速度非常快，仅需1-2个小时左右。我让进程运行了一整夜（当时是23万），8小时后才达到310K，但仍在进行。

有人知道为什么会这样吗？起初我以为可能是因为图像存储在HDD上，但这看起来真的没有意义，因为前200-250k很快。

Answer 1

如果您有很多图像，建议您使用多重处理。我创建了100,000个文件，其中5％已损坏，并按以下方式检查它们：

#!/usr/bin/env python3

import glob
from multiprocessing import Pool
from PIL import Image

def CheckOne(f):
    try:
        im = Image.open(f)
        im.verify()
        im.close()
        # DEBUG: print(f"OK: {f}")
        return
    except (IOError, OSError, Image.DecompressionBombError):
        # DEBUG: print(f"Fail: {f}")
        return f

if __name__ == '__main__':
    # Create a pool of processes to check files
    p = Pool()

    # Create a list of files to process
    files = [f for f in glob.glob("*.jpg")]

    print(f"Files to be checked: {len(files)}")

    # Map the list of files to check onto the Pool
    result = p.map(CheckOne, files)

    # Filter out None values representing files that are ok, leaving just corrupt ones
    result = list(filter(None, result)) 
    print(f"Num corrupt files: {len(result)}")

示例输出

Files to be checked: 100002
Num corrupt files: 5001

在装有NVME磁盘的12核CPU上花费1.6秒，但对您来说仍然应该明显更快。

检查包含数十万张图像的目录中的损坏文件会逐渐减慢速度

1 个答案: