我正在尝试找到不同大小的三个哈希集的交集。通过改变集合相交的顺序,可以找到交叉点的速率有什么不同。示例程序如下:
public class RetainTest {
static Set<Integer> large =new HashSet<>();
static Set<Integer> medium =new HashSet<>();
static Set<Integer> small =new HashSet<>();
static int largeSize=10000;
static int midSize=5000;
static int smallSize=1000;
public static void main(String[] args){
preamble()
large.retainAll(medium);
large.retainAll(small);
System.out.println(large.size());
}
public static void preamble(){
large =new HashSet<>();
medium =new HashSet<>();
small =new HashSet<>();
Random rnd=new Random(15);
for(int i=0;i<largeSize;i++){
large.add(rnd.nextInt(largeSize*10));
}
for(int i=0;i<midSize;i++){
medium.add(rnd.nextInt(largeSize*10));
}
for(int i=0;i<smallSize;i++){
small.add(rnd.nextInt(largeSize*10));
}
}
}
答案 0 :(得分:3)
分析表明,组合多个集合的最快方法是将retainAll
较大的集合放入较小的集合中。此外,这些保留的顺序也应该从最小到最大。所以
small.retainAll(medium);
small.retainAll(large);
分析表明差异很大:对于这个数据集,最慢的订单大约是最慢订单的10倍
使用以下测试程序创建这些结果,该程序将运行20分钟
public class RetainTest {
static Set<Integer> large =new HashSet<>();
static Set<Integer> medium =new HashSet<>();
static Set<Integer> small =new HashSet<>();
static int largeSize=10000;
static int midSize=5000;
static int smallSize=1000;
public static void main(String[] args){
while(true){
preamble();
int size1=largeMediumSmall().size();
preamble();
int size2=largeSmallMedium().size();
preamble();
int size3=smallMediumLarge().size();
preamble();
int size4=smallLargeMedium().size();
preamble();
int size5=mediumSmallLarge().size();
preamble();
int size6=mediumLargeSmall().size();
//sanity check + ensuring the JIT can't optimise out
if (size1!=size2 || size1!=size3 || size1!=size4 || size1!=size5 || size1!=size6){
System.out.println("bad");
}
}
}
public static Set<Integer> largeMediumSmall(){
large.retainAll(medium);
large.retainAll(small);
return large;
}
public static Set<Integer> smallMediumLarge(){
small.retainAll(medium);
small.retainAll(large);
return small;
}
public static Set<Integer> smallLargeMedium(){
small.retainAll(large);
small.retainAll(medium);
return small;
}
public static Set<Integer> mediumSmallLarge(){
medium.retainAll(small);
medium.retainAll(large);
return medium;
}
public static Set<Integer> mediumLargeSmall(){
medium.retainAll(large);
medium.retainAll(small);
return medium;
}
public static Set<Integer> largeSmallMedium(){
large.retainAll(small);
large.retainAll(medium);
return large;
}
public static void preamble(){
large =new HashSet<>();
medium =new HashSet<>();
small =new HashSet<>();
Random rnd=new Random(15);
for(int i=0;i<largeSize;i++){
large.add(rnd.nextInt(largeSize*10));
}
for(int i=0;i<midSize;i++){
medium.add(rnd.nextInt(largeSize*10));
}
for(int i=0;i<smallSize;i++){
small.add(rnd.nextInt(largeSize*10));
}
}
}
答案 1 :(得分:2)
对散列集的查询成本不依赖于集合的大小。 setA.retainAll(setB)
通过setA
对setB
的查询进行迭代(请参阅AbstractCollection.retainAll()
的实施)。此操作的总体成本线性地取决于setA
的大小。因此,您应该始终迭代最小集:
small.retainAll(medium);
small.retainAll(large);
Richard Tingle的基准证明了这一点 编辑啊,Richard Tingle也是问题的作者:)
如果您只有三组并且性能非常重要,请尝试在单次迭代中找到交集:
Iterator<E> it = small.iterator();
while (it.hasNext()) {
E e = it.next();
if (!medium.contains(e) || !large.contains(e))
it.remove();
}
自Java 8以来:
small.removeIf(e -> !medium.contains(e) || !large.contains(e));