Question

我试图解决问题Nth Ugly Number。我试图使用HashSet来避免向PriorityQueue添加重复的ele。我希望HashSet中的add（）contains（）是O（1），这比PriorityQueue add（）O（log（n））更好。但是，我发现我的实现总是比PriorityQueue解决方案更差。

然后，我把conflit视为重复比率。它不断超过10％。因此，随着N的增长，使用HashSet应该更好（对于大n，10％* log（n）>> 90％* C）。奇怪的是随着N的增长，使用HashSet变得更加糟糕。从1,000,000的n = 1000,10000,100000到3倍以及10,000,000的4倍时几乎相同的性能。我读过（Fastest Java HashSet<Integer> library）说1.5n的初始容量。因此，HashSet通常有2.5~3n个元素。我将4n或5n设置为我的HashSet。它没有任何帮助。

有谁知道为什么会这样？

public class Test {
  int conflict = 0;

  public static void main(String[] args) {
    Test test = new Test();
    long start = System.currentTimeMillis();
    int N = 10000000;
    test.nthUglyNumber(N);
    long end = System.currentTimeMillis();
    System.out.println("Time:" + (end - start));


    start = System.currentTimeMillis();
    test.nthUglyNumber2(N);
    end = System.currentTimeMillis();
    System.out.println("Time:" + (end - start));
  }

  public int nthUglyNumber(int n) {
    if (n <= 0) {
      return 0;
    }
    HashSet<Integer> CLOSED = new HashSet<Integer>(5 * n);
    PriorityQueue<Integer> OPEN = new PriorityQueue<Integer>();
    int cur = 1;
    OPEN.add(cur);
    CLOSED.add(cur);
    while (n > 1) {
      n--;
      cur = OPEN.poll();
      int cur2 = cur * 2;
      if (CLOSED.add(cur2)) {
        OPEN.add(cur2);
      }
      // else {
      // conflict++;
      // }
      int cur3 = cur * 3;
      if (CLOSED.add(cur3)) {
        OPEN.add(cur3);
      }
      // else{
      // conflict++;
      // }

      int cur5 = cur * 5;
      if (CLOSED.add(cur5)) {
        OPEN.add(cur5);
      }
      // else{
      // conflict++;
      // }
    }
    return OPEN.peek();
  }

  public int nthUglyNumber2(int n) {
    if (n == 1)
      return 1;
    PriorityQueue<Long> q = new PriorityQueue();
    q.add(1l);

    for (long i = 1; i < n; i++) {
      long tmp = q.poll();
      while (!q.isEmpty() && q.peek() == tmp)
        tmp = q.poll();

      q.add(tmp * 2);
      q.add(tmp * 3);
      q.add(tmp * 5);
    }
    return q.poll().intValue();
  }
}

Answer 1

我认为您的分析不考虑内存管理开销。每次GC运行时，都需要跟踪并移动HashSet中的部分或全部可到达对象。虽然在一般情况下难以量化这一点，但在最坏的情况下（完整的GC），额外的工作是O(N)。

还可能存在二次记忆效应;例如具有HashSet的版本将具有更大的工作集，这将导致更多的内存缓存未命中。这在垃圾收集过程中最为明显。

我建议您分析代码的两个版本，以确定实际消耗额外时间的位置。

如果您正在寻找使缓存更好的方法：

寻找该集合的专门表示;例如Bitset或第三方图书馆。
考虑使用LinkedHashSet并在通过可以缓存命中的窗口后删除条目。

Answer 2

请注意，如果没有冲突（90％的情况），请拨打add两次：HashSet上的一个，{{1}上的一个};而PriorityQueue - 仅解决方案只调用PrioertyQueue一次。

因此，add会在90％的案例中增加开销，同时只加快其中的10％。

为什么大N中的HashSet性能不好？

2 个答案: