Question

我在一个文件夹中有一些（38000）个图片/视频文件。其中大约40％是重复的，我试图摆脱它。我的问题是，如何判断2个文件是否相同？到目前为止，我尝试使用文件的SHA1但事实证明许多重复文件有不同的哈希值。这是我正在使用的代码：

public static String getHash(File doc) {
    MessageDigest md = null;
    try {
        md = MessageDigest.getInstance("SHA1");
        FileInputStream inStream = new FileInputStream(doc);
        DigestInputStream dis = new DigestInputStream(inStream, md);
        BufferedInputStream bis = new BufferedInputStream(dis);
        while (true) {
            int b = bis.read();
            if (b == -1)
                break;
        }

        inStream.close();
        dis.close();
        bis.close();
    } catch (NoSuchAlgorithmException | IOException e) {
        e.printStackTrace();
    }

    BigInteger bi = new BigInteger(md.digest());

    return bi.toString(16);
}

我可以以任何方式修改此内容吗？或者我必须使用不同的方法吗？

Answer 1

如上所述，重复检测可以基于散列。但是，如果您希望接近重复检测，这意味着您要搜索的图像基本上显示相同的内容，但已经缩放，旋转等，您可能需要基于内容的图像检索方法。有LIRE（https://code.google.com/p/lire/），这是一个Java库，你可以在下载部分找到“SimpleApplication”。那么你可以做的是

索引第一张图片
转到下一张图片我
在索引中搜索我
如果分数低于阈值的结果，则将其标记为重复
索引I
转到（2）

我的学生做到了，它运作良好，但我手边没有源代码。但请放心，这只是几行，简单的应用程序将帮助您入门。

Answer 2

除了使用哈希之外，如果您的副本具有不同的大小（因为它们已调整大小），您可以逐个像素地比较（可能不是整个图像而是图像的子部分）。

这可能取决于图像格式，但您可以通过比较高度和宽度进行比较，然后使用RGB代码逐个像素地进行比较。为了提高效率，您可以确定比较阈值。例如：

public class Main {
    public static void main(String[] args) throws IOException {
        ImageChecker i = new ImageChecker();
        BufferedImage one = ImageIO.read(new File("D:/Images/460249177.jpg"));
        BufferedImage two = ImageIO.read(new File("D:/Images/460249177a.jpg"));
        if(one.getWidth() + one.getHeight() >= two.getWidth() + two.getHeight()) {
            i.setOne(one);
            i.setTwo(two);
        } else {
            i.setOne(two);
            i.setTwo(one);
        }
        System.out.println(i.compareImages());
    }
}

public class ImageChecker {

    private BufferedImage one;
    private BufferedImage two;
    private double difference = 0;
    private int x = 0;
    private int y = 0;

    public ImageChecker() {

    }

    public boolean compareImages() {
        int f = 20;
        int w1 = Math.min(50, one.getWidth() - two.getWidth());
        int h1 = Math.min(50, one.getHeight() - two.getHeight());
        int w2 = Math.min(5, one.getWidth() - two.getWidth());
        int h2 = Math.min(5, one.getHeight() - two.getHeight());
        for (int i = 0; i <= one.getWidth() - two.getWidth(); i += f) {
            for (int j = 0; j <= one.getHeight() - two.getHeight(); j += f) {
                compareSubset(i, j, f);
            }
        }

        one = one.getSubimage(Math.max(0, x - w1), Math.max(0, y - h1),
                Math.min(two.getWidth() + w1, one.getWidth() - x + w1),
                Math.min(two.getHeight() + h1, one.getHeight() - y + h1));
        x = 0;
        y = 0;
        difference = 0;
        f = 5;
        for (int i = 0; i <= one.getWidth() - two.getWidth(); i += f) {
            for (int j = 0; j <= one.getHeight() - two.getHeight(); j += f) {
                compareSubset(i, j, f);
            }
        }
        one = one.getSubimage(Math.max(0, x - w2), Math.max(0, y - h2),
                Math.min(two.getWidth() + w2, one.getWidth() - x + w2),
                Math.min(two.getHeight() + h2, one.getHeight() - y + h2));
        f = 1;
        for (int i = 0; i <= one.getWidth() - two.getWidth(); i += f) {
            for (int j = 0; j <= one.getHeight() - two.getHeight(); j += f) {
                compareSubset(i, j, f);
            }
        }
        System.out.println(difference);
        return difference < 0.1;
    }

    public void compareSubset(int a, int b, int f) {
        double diff = 0;
        for (int i = 0; i < two.getWidth(); i += f) {
            for (int j = 0; j < two.getHeight(); j += f) {
                int onepx = one.getRGB(i + a, j + b);
                int twopx = two.getRGB(i, j);
                int r1 = (onepx >> 16);
                int g1 = (onepx >> 8) & 0xff;
                int b1 = (onepx) & 0xff;
                int r2 = (twopx >> 16);
                int g2 = (twopx >> 8) & 0xff;
                int b2 = (twopx) & 0xff;
                diff += (Math.abs(r1 - r2) + Math.abs(g1 - g2) + Math.abs(b1
                        - b2)) / 3.0 / 255.0;
            }
        }
        double percentDiff = diff * f * f / (two.getWidth() * two.getHeight());
        if (percentDiff < difference || difference == 0) {
            difference = percentDiff;
            x = a;
            y = b;
        }
    }

    public BufferedImage getOne() {
        return one;
    }

    public void setOne(BufferedImage one) {
        this.one = one;
    }

    public BufferedImage getTwo() {
        return two;
    }

    public void setTwo(BufferedImage two) {
        this.two = two;
    }
}

Answer 3

你需要使用aHash，pHash和dHash算法的最佳方法。

这几天我写了一个纯java库。您可以使用目录路径（包括子目录）来提供它，它将使用您要删除的绝对路径列出列表中的重复图像。或者，您也可以使用它来查找目录中的所有唯一图像。

它在内部使用了awt api，因此不能用于Android。因为，imageIO在阅读很多新类型的图像时遇到了问题，我使用的是十二只内部使用的猴子罐。

https://github.com/srch07/Duplicate-Image-Finder-API

可以从https://github.com/srch07/Duplicate-Image-Finder-API/blob/master/archives/duplicate_image_finder_1.0.jar

下载内部捆绑的依赖项jar

api也可以在不同大小的图像中找到重复。

Answer 4

您可以使用以下方式转换文件： imagemagick convert格式具有规范表示和尽可能少的元数据。我想我会使用PNM。所以尝试这样的事情：

convert input.png pnm:- | md5sum -

如果对两个之前比较不同的文件产生相同的结果，那么元数据实际上是问题的根源，您可以使用这样的命令行方法，或者将代码更新为{{3}并根据原始未压缩数据计算哈希值。

另一方面，如果不同的文件仍然比较不同，那么您对实际图像数据进行了一些更改。一个可能的原因可能是添加或删除alpha通道，特别是如果您在此处理PNG。另一方面，使用JPEG，您可能会将图像解压缩，然后再次重新压缩，这将导致轻微的修改和数据丢失。 JPEG是一种固有的有损编解码器，任何两个图像都可能不同，除非它们是使用相同的应用程序（或库）创建的，具有相同的设置和相同的输入数据。在这种情况下，您需要执行模糊图像匹配。 read the image之类的工具可以执行此类操作。如果你想自己做这件事，你将面临很多工作，并且应该事先做一些研究。

Answer 5

已经很长时间了，所以我应该解释一下我是如何解决问题的。真正的诀窍是不要使用哈希开始，而只是比较exif数据中的时间戳。鉴于这些照片是由我的妻子拍摄的，因此不同的文件不太可能具有相同的时间戳，因此这种更简单的解决方案实际上更加可靠。

Answer 6

您可以通过以下方法检查两个图像的不同百分比，如果os的百分比小于10，则可以将其称为相同图像：

 private static double getDifferencePercent(BufferedImage img1, BufferedImage img2) {
    int width = img1.getWidth();
    int height = img1.getHeight();
    int width2 = img2.getWidth();
    int height2 = img2.getHeight();
    if (width != width2 || height != height2) {
        throw new IllegalArgumentException(String.format("Images must have the same dimensions: (%d,%d) vs. (%d,%d)", width, height, width2, height2));
    }

    long diff = 0;
    for (int y = 0; y < height; y++) {
        for (int x = 0; x < width; x++) {
            diff += pixelDiff(img1.getRGB(x, y), img2.getRGB(x, y));
        }
    }
    long maxDiff = 3L * 255 * width * height;

    return 100.0 * diff / maxDiff;
}

private static int pixelDiff(int rgb1, int rgb2) {
    int r1 = (rgb1 >> 16) & 0xff;
    int g1 = (rgb1 >>  8) & 0xff;
    int b1 =  rgb1        & 0xff;
    int r2 = (rgb2 >> 16) & 0xff;
    int g2 = (rgb2 >>  8) & 0xff;
    int b2 =  rgb2        & 0xff;
    return Math.abs(r1 - r2) + Math.abs(g1 - g2) + Math.abs(b1 - b2);
}
  // covert image to Buffered image through this method

public static BufferedImage toBufferedImage(Image img)
{
    if (img instanceof BufferedImage)
    {
        return (BufferedImage) img;
    }

    // Create a buffered image with transparency
    BufferedImage bimage = new BufferedImage(img.getWidth(null), img.getHeight(null), BufferedImage.TYPE_INT_ARGB);

    // Draw the image on to the buffered image
    Graphics2D bGr = bimage.createGraphics();
    bGr.drawImage(img, 0, 0, null);
    bGr.dispose();

    // Return the buffered image
    return bimage;
}

从此站点获得见解的想法：https://rosettacode.org/wiki/Percentage_difference_between_images#Kotlin

Answer 7

很久以前就问过这个问题。我发现以下链接非常有用，它具有适用于所有语言的代码。 https://rosettacode.org/wiki/Percentage_difference_between_images#Kotlin

这是从链接中获取的Kotlin代码

import java.awt.image.BufferedImage
import java.io.File
import javax.imageio.ImageIO
import kotlin.math.abs

fun getDifferencePercent(img1: BufferedImage, img2: BufferedImage): Double {
    val width = img1.width
    val height = img1.height
    val width2 = img2.width
    val height2 = img2.height
    if (width != width2 || height != height2) {
        val f = "(%d,%d) vs. (%d,%d)".format(width, height, width2, height2)
        throw IllegalArgumentException("Images must have the same dimensions: $f")
    }
    var diff = 0L
    for (y in 0 until height) {
        for (x in 0 until width) {
            diff += pixelDiff(img1.getRGB(x, y), img2.getRGB(x, y))
        }
    }
    val maxDiff = 3L * 255 * width * height
    return 100.0 * diff / maxDiff
}

fun pixelDiff(rgb1: Int, rgb2: Int): Int {
    val r1 = (rgb1 shr 16) and 0xff
    val g1 = (rgb1 shr 8)  and 0xff
    val b1 =  rgb1         and 0xff
    val r2 = (rgb2 shr 16) and 0xff
    val g2 = (rgb2 shr 8)  and 0xff
    val b2 =  rgb2         and 0xff
    return abs(r1 - r2) + abs(g1 - g2) + abs(b1 - b2)
}

fun main(args: Array<String>) {
    val img1 = ImageIO.read(File("Lenna50.jpg"))
    val img2 = ImageIO.read(File("Lenna100.jpg"))

    val p = getDifferencePercent(img1, img2)
    println("The percentage difference is ${"%.6f".format(p)}%")
}

比较图像以查找重复项

7 个答案: