Question

我正在创建一个程序，通过将MD5与已经检查过的MD5的数据库进行比较来检查文件。

它遍历数千个文件，我发现它占用了大量内存。

如何让我的代码尽可能高效？

    for (File f : directory.listFiles()) {


        String MD5;
        //Check if the Imagefile instance is an image. If so, check if it's already in the pMap.
        if (Utils.isImage(f)) {
            MD5 = Utils.toMD5(f);
            if (!SyncFolderMapImpl.MD5Map.containsKey(MD5)) {

                System.out.println("Adding " + f.getName() + " to DB");
                add(new PhotoDTO(f.getPath(), MD5, albumName));
            }
        }

这是到了MD5：

  public static String toMD5(File file) throws IOException, NoSuchAlgorithmException {
    MessageDigest md = MessageDigest.getInstance("MD5");
    FileInputStream fis = new FileInputStream(file.getPath());


    byte[] dataBytes = new byte[8192];

    int nread = 0;
    while ((nread = fis.read(dataBytes)) != -1) {
        md.update(dataBytes, 0, nread);
    }

    byte[] mdbytes = md.digest();

    //convert the byte to hex format method 2
    StringBuffer hexString = new StringBuffer();
    for (int i = 0; i < mdbytes.length; i++) {
        String hex = Integer.toHexString(0xff & mdbytes[i]);
        if (hex.length() == 1) hexString.append('0');
        hexString.append(hex);
    }
    return hexString.toString();
}

编辑：试图使用FastMD5。结果相同。

public static String toMD5(File file) throws IOException, NoSuchAlgorithmException {

    return MD5.asHex(MD5.getHash(file));
}

编辑2 尝试使用ThreadLocal和BufferedInputStream。我仍然有很多内存使用。

private static ThreadLocal<MessageDigest> md = new ThreadLocal<MessageDigest>(){
     protected MessageDigest initialValue() {
         try {
             return MessageDigest.getInstance("MD5");
         } catch (NoSuchAlgorithmException e) {
             e.printStackTrace();  //To change body of catch statement use File | Settings | File Templates.
         }
         System.out.println("Fail");
         return null;

     }
};


private static ThreadLocal<byte[]> dataBytes = new ThreadLocal<byte[]>(){

    protected byte[] initialValue(){
     return new byte[1024];
    }

};

public static String toMD5(File file) throws IOException, NoSuchAlgorithmException {

    //        MessageDigest mds = md.get();
    BufferedInputStream fis = new BufferedInputStream(new FileInputStream(file));


    //        byte[] dataBytes = new byte[1024];

    int nread = 0;
    while ((nread = fis.read(dataBytes.get())) != -1) {
        md.get().update(dataBytes.get(), 0, nread);
    }

    byte[] mdbytes = md.get().digest();

    //convert the byte to hex format method 2
    StringBuffer hexString = new StringBuffer();
    fis.close();
    System.gc();
    return javax.xml.bind.DatatypeConverter.printHexBinary(mdbytes).toLowerCase();




     //        return MD5.asHex(MD5.getHash(file));
}

Answer 1

如何让我的代码尽可能高效？

用两个词来说：简介！

让您的代码正常工作，然后在一组典型的输入文件上运行时对其进行概要分析。用它来告诉你性能热点在哪里。

如果我这样做，我首先会开始使用单线程版本并针对该案例进行调整。然后我慢慢收起线程数，看看性能如何扩展。一旦你找到了“甜蜜点”，重新进行分析，看看现在的瓶颈在哪里。

实际上很难预测性能瓶颈会在哪里出现。它取决于平均文件大小，内核数量，光盘速度以及操作系统可用于预读缓冲的内存量。此外，您正在使用的操作系统。

我的直觉是线程数量相当重要。太少，CPU处于空闲状态，等待I / O系统从光盘中获取内容。太多，你使用额外的资源（如线程堆栈的内存），没有真正的好处。像这样的应用程序很可能是I / O绑定的，并且大量的线程不会解决这个问题。

您评论如此：

性能问题纯粹是记忆。我很确定我创建MD5哈希的方式存在问题，因此浪费了内存。

我在你提供的使用大量内存的代码中看不到任何内容。你产生哈希的方式没有什么大不了的。 AFAICT，代码导致内存使用问题的唯一方法是：

你有许多线程都执行该代码，或者
你在记忆中保留了许多哈希（和其他东西）。（您没有告诉我们add正在做什么。）

但是我的建议是类似的，使用内存分析器并将其诊断为好像是存储泄漏，从某种意义上说，它就是存储泄漏！

Answer 2

快速浏览一下你的代码有三件事：

每次调用MessageDigest方法时，无需创建新的toMD5。每个线程一个就足够了。
每次调用byte[]方法时，都不需要创建新的toMD5缓冲区。每个线程一个就足够了。
您可能希望使用javax.xml.bind.DatatypeConverter.printHexBinary(byte[])进行十六进制转换。它更快。

您可以使用每个ThreadLocal来解决前两个项目符号。

任何进一步的优化都可能来自并发。让一个线程读取文件内容，并将这些byte[]分派给不同的线程，以实际计算MD5校验和。

Answer 3

使用更大的缓冲区，至少8192，或插入BufferedInputStream.

Answer 4

感谢大家的帮助。问题是，通过的信息量如此之高，以至于GC无法正常工作。概念验证解决方案是在每200张照片后添加一个Thread.sleep（1000）。完整的解决方案是使用GC更积极的方法，并一次计算批量的MD5。

在许多文件上循环MD5计算器时的性能问题

4 个答案: