Question

我正在尝试找到不同大小的三个哈希集的交集。通过改变集合相交的顺序，可以找到交叉点的速率有什么不同。示例程序如下：

public class RetainTest {
    static Set<Integer> large =new HashSet<>();
    static Set<Integer> medium =new HashSet<>();
    static Set<Integer> small =new HashSet<>();

    static int largeSize=10000;
    static int midSize=5000;
    static int smallSize=1000;      

    public static void main(String[] args){
        preamble()
        large.retainAll(medium);
        large.retainAll(small);

        System.out.println(large.size());
    }


    public static void preamble(){
        large =new HashSet<>();
        medium =new HashSet<>();
        small =new HashSet<>();

        Random rnd=new Random(15);
        for(int i=0;i<largeSize;i++){
            large.add(rnd.nextInt(largeSize*10));
        }

        for(int i=0;i<midSize;i++){
            medium.add(rnd.nextInt(largeSize*10));
        }
        for(int i=0;i<smallSize;i++){
            small.add(rnd.nextInt(largeSize*10));
        }

    }

}

Answer 1

分析表明，组合多个集合的最快方法是将retainAll较大的集合放入较小的集合中。此外，这些保留的顺序也应该从最小到最大。所以

    small.retainAll(medium);
    small.retainAll(large);

分析表明差异很大：对于这个数据集，最慢的订单大约是最慢订单的10倍

enter image description here

测试程序

使用以下测试程序创建这些结果，该程序将运行20分钟

public class RetainTest {

    static Set<Integer> large =new HashSet<>();
    static Set<Integer> medium =new HashSet<>();
    static Set<Integer> small =new HashSet<>();

    static int largeSize=10000;
    static int midSize=5000;
    static int smallSize=1000;      

    public static void main(String[] args){
        while(true){
            preamble();
            int size1=largeMediumSmall().size();
            preamble();
            int size2=largeSmallMedium().size();
            preamble();
            int size3=smallMediumLarge().size();
            preamble();
            int size4=smallLargeMedium().size();
            preamble();
            int size5=mediumSmallLarge().size();
            preamble();
            int size6=mediumLargeSmall().size();

            //sanity check + ensuring the JIT can't optimise out
            if (size1!=size2 || size1!=size3 || size1!=size4 || size1!=size5 || size1!=size6){
                System.out.println("bad");
            }
        }


    }

    public static Set<Integer> largeMediumSmall(){
        large.retainAll(medium);
        large.retainAll(small);

        return large;
    }

    public static Set<Integer> smallMediumLarge(){
        small.retainAll(medium);
        small.retainAll(large);

        return small;
    }
    public static Set<Integer> smallLargeMedium(){
        small.retainAll(large);
        small.retainAll(medium);

        return small;
    }
    public static Set<Integer> mediumSmallLarge(){
        medium.retainAll(small);
        medium.retainAll(large);

        return medium;
    }
    public static Set<Integer> mediumLargeSmall(){
        medium.retainAll(large);
        medium.retainAll(small);

        return medium;
    }
    public static Set<Integer> largeSmallMedium(){
        large.retainAll(small);
        large.retainAll(medium);

        return large;
    }


    public static void preamble(){
        large =new HashSet<>();
        medium =new HashSet<>();
        small =new HashSet<>();

        Random rnd=new Random(15);
        for(int i=0;i<largeSize;i++){
            large.add(rnd.nextInt(largeSize*10));
        }

        for(int i=0;i<midSize;i++){
            medium.add(rnd.nextInt(largeSize*10));
        }
        for(int i=0;i<smallSize;i++){
            small.add(rnd.nextInt(largeSize*10));
        }

    }

}

Answer 2

对散列集的查询成本不依赖于集合的大小。 setA.retainAll(setB)通过setA对setB的查询进行迭代（请参阅AbstractCollection.retainAll()的实施）。此操作的总体成本线性地取决于setA的大小。因此，您应该始终迭代最小集：

small.retainAll(medium);
small.retainAll(large);

Richard Tingle的基准证明了这一点编辑啊，Richard Tingle也是问题的作者：）

如果您只有三组并且性能非常重要，请尝试在单次迭代中找到交集：

Iterator<E> it = small.iterator();
while (it.hasNext()) {
    E e = it.next();
    if (!medium.contains(e) || !large.contains(e))
        it.remove();
}

自Java 8以来：

small.removeIf(e -> !medium.contains(e) || !large.contains(e));

当找到几组的交集时，这是使用retainAll（）的最快顺序

2 个答案:

测试程序