Question

我有文件流，以及一个将两个文件作为参数的方法，如果它们具有相同的内容则返回。

我想将这个文件流缩减为集合（或映射）集合所有具有相同内容的文件。

我知道这可以通过重构compare方法来获取一个文件，返回一个哈希，然后通过给予收集器的函数返回的哈希对流进行分组。 但是使用compare方法实现这一目标的最简洁方法是什么，它采用两个文件并返回一个布尔值？

为清楚起见，这是一个使用一个参数函数解决方案的显而易见的方法的例子

file.stream().collect(groupingBy(f -> Utility.getHash(f))

但在我的情况下，我有以下方法，我想在分区过程中使用

public boolean isFileSame(File f, File f2) {
    return Files.equal(f, f2)
}

Answer 1

如果你拥有的是BiPredicate而没有相关的哈希函数可以进行有效的查找，那么你只能使用线性探测。没有内置的收集器可以做到这一点，但是可以实现在原始groupingBy收集器附近工作的自定义收集器，如

public static <T> Collector<T,?,Map<T,Set<T>>> groupingBy(BiPredicate<T,T> p) {
    return Collector.of(HashMap::new,
        (map,t) -> {
            for(Map.Entry<T,Set<T>> e: map.entrySet())
                if(p.test(t, e.getKey())) {
                    e.getValue().add(t);
                    return;
                }
            map.computeIfAbsent(t, x->new HashSet<>()).add(t);
        }, (m1,m2) -> {
            if(m1.isEmpty()) return m2;
            m2.forEach((t,set) -> {
                for(Map.Entry<T,Set<T>> e: m1.entrySet())
                    if(p.test(t, e.getKey())) {
                        e.getValue().addAll(set);
                        return;
                    }
                m1.put(t, set);
            });
            return m1;
        }
    );

但是，当然，你拥有的群体越多，表现就越差。

对于您的特定任务，使用

会更有效

public static ByteBuffer readUnchecked(Path p) {
    try {
        return ByteBuffer.wrap(Files.readAllBytes(p));
    } catch(IOException ex) {
        throw new UncheckedIOException(ex);
    }
}

和

Set<Set<Path>> groupsByContents = your stream of Path instances
    .collect(Collectors.collectingAndThen(
        Collectors.groupingBy(YourClass::readUnchecked, Collectors.toSet()),
        map -> new HashSet<>(map.values())));

它将按内容对文件进行分组并隐式进行散列。请记住，相等的哈希并不意味着相同的内容，但这个解决方案已经解决了这个问题。整理函数map -> new HashSet<>(map.values())可确保生成的集合在操作后不会将文件的内容保留在内存中。

Answer 2

帮助程序类Wrapper的可能解决方案：

files.stream()
    .collect(groupingBy(f -> Wrapper.of(f, Utility::getHash, Files::equals)))
    .keySet().stream().map(Wrapper::value).collect(toList());

如果由于某种原因不想使用Utility.getHash，请尝试使用File.length()作为哈希函数。 Wrapper提供了一种通用解决方案来为任何类型（例如数组）定制散列/等于函数。将它放入工具箱中是很有用的。以下是Wrapper的示例实现：

public class Wrapper<T> {
    private final T value;
    private final ToIntFunction<? super T> hashFunction;
    private final BiFunction<? super T, ? super T, Boolean> equalsFunction;
    private int hashCode;

    private Wrapper(T value, ToIntFunction<? super T> hashFunction, BiFunction<? super T, ? super T, Boolean> equalsFunction) {
        this.value = value;
        this.hashFunction = hashFunction;
        this.equalsFunction = equalsFunction;
    }
    public static <T> Wrapper<T> of(T value, ToIntFunction<? super T> hashFunction, BiFunction<? super T, ? super T, Boolean> equalsFunction) {
        return new Wrapper<>(value, hashFunction, equalsFunction);
    }
    public T value() {
        return value;
    }
    @Override
    public int hashCode() {
        if (hashCode == 0) {
            hashCode = value == null ? 0 : hashFunction.applyAsInt(value);
        }

        return hashCode;
    }
    @Override
    public boolean equals(Object obj) {
        return (obj == this) || (obj instanceof Wrapper && equalsFunction.apply(((Wrapper<T>) obj).value, value));
    }
    // TODO ...
}

Java 8，如何使用BiPredicate将流元素分组到集合

2 个答案: