Question

我已经整理/编写了一些代码（感谢stackoverflow用户！），这些代码使用imagehash检查图像中的相似性，但是现在我在检查数千个图像（大约16,000个图像）时遇到了问题。有什么我可以用代码（或完全不同的路由）改进的东西，可以更准确地找到匹配项和/或减少所需的时间？谢谢！

我首先将创建的列表更改为itertools组合，因此它仅比较图像的唯一组合。

new_loc = os.chdir(r'''myimagelocation''')
dirloc = os.listdir(r'''myimagelocation''')

duplicates = []
dup = []

for f1, f2 in itertools.combinations(dirloc,2):
    #Honestly not sure which hash method to use, so I went with dhash.
    dhash1 = imagehash.dhash(Image.open(f1))
    dhash2 = imagehash.dhash(Image.open(f2))
    hashdif = dhash1 - dhash2


    if hashdif < 5:  #May change the 5 to find more accurate matches
            print("images are similar due to dhash", "image1", f1, "image2", f2)
            duplicates.append(f1)
            dup.append(f2)

    #Setting up a CSV file with the similar images to review before deleting
    with open("duplicates.csv", "w") as myfile:
        wr = csv.writer(myfile)
        wr.writerows(zip(duplicates, dup))

当前，此代码可能需要几天的时间来处理我在该文件夹中拥有的图像数量。我希望将其减少到几个小时。

Answer 1

尝试此操作，而不是在比较时哈希每个图像（127,992,000个哈希），而是提前哈希并比较哈希，因为这些哈希不会改变（16,000个哈希）。

new_loc = os.chdir(r'''myimagelocation''')
dirloc = os.listdir(r'''myimagelocation''')

duplicates = []
dup = []

hashes = []

for file in dirloc:
    hashes.append((file, imagehash.dhash(Image.open(file))))

for pair1, pair2 in itertools.combinations(hashes,2):
    f1, dhash1 = pair1
    f2, dhash2 = pair2
    #Honestly not sure which hash method to use, so I went with dhash.
    hashdif = dhash1 - dhash2


    if hashdif < 5:  #May change the 5 to find more accurate matches
            print("images are similar due to dhash", "image1", f1, "image2", f2)
            duplicates.append(f1)
            dup.append(f2)

#Setting up a CSV file with the similar images to review before deleting
with open("duplicates.csv", "w") as myfile: # also move this out of the loop so you arent rewriting the file every time
    wr = csv.writer(myfile)
    wr.writerows(zip(duplicates, dup))

在数千个文件夹中查找图像相似性

1 个答案: