Question

我得到了一个大小相同的文件列表，我需要返回一个列表，其中包含所有具有相同内容的文件。

我的想法是首先将文件散列到地图，其关键字是md5散列值，value是具有密钥散列值的路径列表。以下是hashing()函数的代码：

public static Map<String, List<String>> hashing(List<File> list) throws Exception {
    Map<String, List<String>> map = new HashMap<>();
    for (File f : list) {
        String path = f.getAbsolutePath();
        FileInputStream in = new FileInputStream(path);
        byte[] dataBytes = new byte[1024];

        MessageDigest md = MessageDigest.getInstance("MD5");
        int n = 0;
        while ((n = in.read(dataBytes)) != -1) {
            md.update(dataBytes, 0, n);
        }
        byte[] mdBytes = md.digest();

        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < mdBytes.length; i++) {
            sb.append(Integer.toString((mdBytes[i] & 0xff) + 0x100, 16).substring(1));
        }
        String hash = sb.toString();
        if (!map.containsKey(hash)) {
            map.put(hash, new ArrayList<>());
        }
        map.get(hash).add(path);
    }
    return map;
}

由于两个文件可以散列到相同的值但是不同，我想比较具有相同散列值的文件以验证它们是否真的相同。这是checkSame()函数：输入List<String>是具有相同哈希值的文件路径列表，List<List<String>>是包含具有相同内容的所有文件的列表列表。

public static void checkSame(List<String> list, List<List<String>> result) throws Exception{
    List<String> temp = new ArrayList<>();
    for (int i = 1; i < list.size(); i++) {
        if (checkContent(list.get(0), list.get(i))) {
            continue;
        }
        list.remove(list.get(i));
        temp.add(list.get(i));
    }
    if (list.size() > 1) {
        result.add(list);
    }
    if (temp.size() > 1) {
        checkSame(temp, result);
    }
}

public static boolean checkContent (String path1, String path2) throws Exception {
    FileInputStream fis1 = new FileInputStream(path1);
    FileInputStream fis2 = new FileInputStream(path2);
    BufferedReader input1 = new BufferedReader(new InputStreamReader(fis1));
    BufferedReader input2 = new BufferedReader(new InputStreamReader(fis2));
    StringBuilder sb1 = new StringBuilder();
    StringBuilder sb2 = new StringBuilder();
    String line1, line2;
    try {
        while ((line1 = input1.readLine()) != null && (line2 = input2.readLine()) != null) {
            sb1.append(line1);
            sb2.append(line2);
            if (!sb1.toString().equals(sb2.toString())) {
                return false;
            }
        }
    } catch(Exception e) {
        e.printStackTrace();
    } finally {
        if (fis1 != null) {
            fis1.close();
        }
        if (fis2 != null) {
            fis2.close();
        }
    }
    return true;
}

我的问题是：

上面的代码有问题吗？
还有其他更有效的方法来解决这个问题吗？

Answer 1

checkSame不使用hashing
通过比较c1 != c2你比较每次迭代时读取的字节数 - 这是没有意义的
checkSame仅比较列表中的第一个文件与所有其他文件，但它不会将秒数与所有其他文件和第三个文件进行比较，等等。

我会稍微修改你的方法：我将通过两个参数进行比较：hash和file-size。我将创建一个名为MyFile的新类，它将包含三个字段：String name，String hashcode，long size。

然后我会迭代文件并使用您在hashing中引入的逻辑来创建MyFile个对象的列表。尺寸可以轻松实现：

private static long fileSize(String filename) {
    File file = new File(filename);
    return file.length();
}

现在让MyFile覆盖两个方法：hashCode()（将根据您在上一步中计算的哈希码返回一个整数）和equals，它只检查哈希码和大小。例如：

class MyFile {

    String name;
    String hash;
    long size;

    public MyFile(String name, String hash, long size) {
        this.name = name;
        this.hash = hash;
        this.size = size;
    }

    @Override
    public boolean equals(Object other) {
        if (!(other instanceof MyFile)) {
            return false;
        }
        MyFile o = (MyFile)other;
        return o.hash.equals(this.hash) && o.size == this.size;
    }

    @Override
    public int hashCode(){
        return hash.hashCode();
    }
}

现在，您可以轻松地比较文件并查看它们是否具有相似的内容（即使它们具有不同的路径）。

如何有效地输出具有相同内容的文件列表？

1 个答案: