Question

我正在尝试找出特定情况下的最佳容量和负载系数。我想我已经掌握了它的要点，但我还是要感谢那些比我更了解的人的确认。：）

如果我知道我的HashMap将填充包含100个对象，并且大部分时间都会占用100个对象，我猜测最佳值是初始容量100和加载因子1？或者我需要容量101，还是有其他问题？

编辑：好的，我预留了几个小时并进行了一些测试。结果如下：

奇怪的是，容量，容量+ 1，容量+2，容量-1和容量-10都产生完全相同的结果。我预计至少容量-1和容量10会产生更差的结果。
使用初始容量（而不是使用默认值16）可以显着提高put（）效率 - 最多可提高30％。
使用1的加载因子可为少量对象提供相同的性能，并为大量对象提供更好的性能（> 100000）。但是，这并没有与物体数量成比例地改善;我怀疑还有其他因素会影响结果。
get（）对于不同数量的对象/容量，性能略有不同，但尽管它可能因具体情况而略有不同，但通常不会受初始容量或负载因素的影响。

EDIT2：我也添加了一些图表。这是说明加载因子0.75和1之间的差异的一个，在我初始化HashMap并将其填充到满容量的情况下。在y标度上是以ms为单位的时间（越低越好），x标度是大小（对象的数量）。由于尺寸线性变化，所需时间也呈线性增长。

所以，让我们看看我得到了什么。以下两个图表显示了负载系数的差异。第一张图表显示了当HashMap填满容量时会发生什么;由于调整大小，负载系数0.75表现更差。然而，它并不总是更糟糕，并且有各种各样的颠簸和跳跃 - 我想GC在这方面有重大影响。载荷系数1.25与1相同，因此它不包括在图表中。

fully filled

此图表证明由于调整大小，0.75更差;如果我们将HashMap填充到一半容量，0.75不会更差，只是......不同（它应该使用更少的内存并且具有不可思议的更好的迭代性能）。

half full

我要展示的另一件事。这可以获得所有三个加载因子和不同HashMap大小的性能。除了加载因子1的一个峰值之外，一直保持不变。我真的想知道那是什么（可能是GC，但谁知道）。

go spike

以下是感兴趣的人的代码：

import java.util.HashMap;
import java.util.Map;

public class HashMapTest {

  // capacity - numbers high as 10000000 require -mx1536m -ms1536m JVM parameters
  public static final int CAPACITY = 10000000;
  public static final int ITERATIONS = 10000;

  // set to false to print put performance, or to true to print get performance
  boolean doIterations = false;

  private Map<Integer, String> cache;

  public void fillCache(int capacity) {
    long t = System.currentTimeMillis();
    for (int i = 0; i <= capacity; i++)
      cache.put(i, "Value number " + i);

    if (!doIterations) {
      System.out.print(System.currentTimeMillis() - t);
      System.out.print("\t");
    }
  }

  public void iterate(int capacity) {
    long t = System.currentTimeMillis();

    for (int i = 0; i <= ITERATIONS; i++) {
      long x = Math.round(Math.random() * capacity);
      String result = cache.get((int) x);
    }

    if (doIterations) {
      System.out.print(System.currentTimeMillis() - t);
      System.out.print("\t");
    }
  }

  public void test(float loadFactor, int divider) {
    for (int i = 10000; i <= CAPACITY; i+= 10000) {
      cache = new HashMap<Integer, String>(i, loadFactor);
      fillCache(i / divider);
      if (doIterations)
        iterate(i / divider);
    }
    System.out.println();
  }

  public static void main(String[] args) {
    HashMapTest test = new HashMapTest();

    // fill to capacity
    test.test(0.75f, 1);
    test.test(1, 1);
    test.test(1.25f, 1);

    // fill to half capacity
    test.test(0.75f, 2);
    test.test(1, 2);
    test.test(1.25f, 2);
  }

}

Answer 1

这是一个非常棒的主题，除了你有一件至关重要的事情。你说：

奇怪的是，容量，容量+ 1，容量+2，容量-1和容量-10都产生完全相同的结果。我预计至少容量-1和容量-10会产生更糟糕的结果。

源代码在内部将初始容量跳到下一个最高二次幂。这意味着，例如，513,600,700,800,900,1000和1024的初始容量都将使用相同的初始容量（1024）。这并不会使@G_H所做的测试无效，但是应该意识到这是在分析他的结果之前完成的。它确实解释了一些测试的奇怪行为。

This is the constructor right for the JDK source:

/**
 * Constructs an empty <tt>HashMap</tt> with the specified initial
 * capacity and load factor.
 *
 * @param  initialCapacity the initial capacity
 * @param  loadFactor      the load factor
 * @throws IllegalArgumentException if the initial capacity is negative
 *         or the load factor is nonpositive
 */
public HashMap(int initialCapacity, float loadFactor) {
    if (initialCapacity < 0)
        throw new IllegalArgumentException("Illegal initial capacity: " +
                                           initialCapacity);
    if (initialCapacity > MAXIMUM_CAPACITY)
        initialCapacity = MAXIMUM_CAPACITY;
    if (loadFactor <= 0 || Float.isNaN(loadFactor))
        throw new IllegalArgumentException("Illegal load factor: " +
                                           loadFactor);

    // Find a power of 2 >= initialCapacity
    int capacity = 1;
    while (capacity < initialCapacity)
        capacity <<= 1;

    this.loadFactor = loadFactor;
    threshold = (int)(capacity * loadFactor);
    table = new Entry[capacity];
    init();
}

Answer 2

选择101。我真的不确定它是否需要，但是可能不值得努力去寻找肯定的。

...只需添加1。

编辑：我答案的一些理由。

首先，我假设您的HashMap不会超过100; 如果是，则应保留负载因子。同样，如果您的关注点是性能，将负载因子保留为。如果你担心的是内存，你可以通过设置静态大小来保存一些内存。如果你在内存中填充了很多东西，这个可能可能值得做。即，存储许多地图，或创建堆空间压力大小的地图。

其次，我选择了值101，因为它提供了更好的可读性...如果我之后查看您的代码，并且看到您已将初始容量设置为100并且您'使用100元素重新加载它，我将不得不通读Javadoc以确保它在精确到达100时不会调整大小。当然，我不会在那里找到答案，所以我将不得不查看来源。这是不值得的...只要留下它101，每个人都很高兴，没有人看到java.util.HashMap的源代码。 Hoorah。

第三，声称将HashMap设置为您对加载因子1 "will kill your lookup and insertion performance"所期望的精确容量的说法是不正确的，即使它是以粗体显示的。

...如果您有n个桶，并且您将n个项目随机分配到n桶中，那么，您最终会在同一个桶中输入项目，确定......但这不是世界末日......在实践中，它只是几个等于比较。事实上，特别是。当您考虑替代方案是将n项分配到n/0.75存储桶时，差别不大。

不需要接受我的话......

快速测试代码：

static Random r = new Random();

public static void main(String[] args){
    int[] tests = {100, 1000, 10000};
    int runs = 5000;

    float lf_sta = 1f;
    float lf_dyn = 0.75f;

    for(int t:tests){
        System.err.println("=======Test Put "+t+"");
        HashMap<Integer,Integer> map = new HashMap<Integer,Integer>();
        long norm_put = testInserts(map, t, runs);
        System.err.print("Norm put:"+norm_put+" ms. ");

        int cap_sta = t;
        map = new HashMap<Integer,Integer>(cap_sta, lf_sta);
        long sta_put = testInserts(map, t, runs);
        System.err.print("Static put:"+sta_put+" ms. ");

        int cap_dyn = (int)Math.ceil((float)t/lf_dyn);
        map = new HashMap<Integer,Integer>(cap_dyn, lf_dyn);
        long dyn_put = testInserts(map, t, runs);
        System.err.println("Dynamic put:"+dyn_put+" ms. ");
    }

    for(int t:tests){
        System.err.println("=======Test Get (hits) "+t+"");
        HashMap<Integer,Integer> map = new HashMap<Integer,Integer>();
        fill(map, t);
        long norm_get_hits = testGetHits(map, t, runs);
        System.err.print("Norm get (hits):"+norm_get_hits+" ms. ");

        int cap_sta = t;
        map = new HashMap<Integer,Integer>(cap_sta, lf_sta);
        fill(map, t);
        long sta_get_hits = testGetHits(map, t, runs);
        System.err.print("Static get (hits):"+sta_get_hits+" ms. ");

        int cap_dyn = (int)Math.ceil((float)t/lf_dyn);
        map = new HashMap<Integer,Integer>(cap_dyn, lf_dyn);
        fill(map, t);
        long dyn_get_hits = testGetHits(map, t, runs);
        System.err.println("Dynamic get (hits):"+dyn_get_hits+" ms. ");
    }

    for(int t:tests){
        System.err.println("=======Test Get (Rand) "+t+"");
        HashMap<Integer,Integer> map = new HashMap<Integer,Integer>();
        fill(map, t);
        long norm_get_rand = testGetRand(map, t, runs);
        System.err.print("Norm get (rand):"+norm_get_rand+" ms. ");

        int cap_sta = t;
        map = new HashMap<Integer,Integer>(cap_sta, lf_sta);
        fill(map, t);
        long sta_get_rand = testGetRand(map, t, runs);
        System.err.print("Static get (rand):"+sta_get_rand+" ms. ");

        int cap_dyn = (int)Math.ceil((float)t/lf_dyn);
        map = new HashMap<Integer,Integer>(cap_dyn, lf_dyn);
        fill(map, t);
        long dyn_get_rand = testGetRand(map, t, runs);
        System.err.println("Dynamic get (rand):"+dyn_get_rand+" ms. ");
    }
}

public static long testInserts(HashMap<Integer,Integer> map, int test, int runs){
    long b4 = System.currentTimeMillis();

    for(int i=0; i<runs; i++){
        fill(map, test);
        map.clear();
    }
    return System.currentTimeMillis()-b4;
}

public static void fill(HashMap<Integer,Integer> map, int test){
    for(int j=0; j<test; j++){
        if(map.put(r.nextInt(), j)!=null){
            j--;
        }
    }
}

public static long testGetHits(HashMap<Integer,Integer> map, int test, int runs){
    long b4 = System.currentTimeMillis();

    ArrayList<Integer> keys = new ArrayList<Integer>();
    keys.addAll(map.keySet());

    for(int i=0; i<runs; i++){
        for(int j=0; j<test; j++){
            keys.get(r.nextInt(keys.size()));
        }
    }
    return System.currentTimeMillis()-b4;
}

public static long testGetRand(HashMap<Integer,Integer> map, int test, int runs){
    long b4 = System.currentTimeMillis();

    for(int i=0; i<runs; i++){
        for(int j=0; j<test; j++){
            map.get(r.nextInt());
        }
    }
    return System.currentTimeMillis()-b4;
}

测试结果：

=======Test Put 100
Norm put:78 ms. Static put:78 ms. Dynamic put:62 ms. 
=======Test Put 1000
Norm put:764 ms. Static put:763 ms. Dynamic put:748 ms. 
=======Test Put 10000
Norm put:12921 ms. Static put:12889 ms. Dynamic put:12873 ms. 
=======Test Get (hits) 100
Norm get (hits):47 ms. Static get (hits):31 ms. Dynamic get (hits):32 ms. 
=======Test Get (hits) 1000
Norm get (hits):327 ms. Static get (hits):328 ms. Dynamic get (hits):343 ms. 
=======Test Get (hits) 10000
Norm get (hits):3304 ms. Static get (hits):3366 ms. Dynamic get (hits):3413 ms. 
=======Test Get (Rand) 100
Norm get (rand):63 ms. Static get (rand):46 ms. Dynamic get (rand):47 ms. 
=======Test Get (Rand) 1000
Norm get (rand):483 ms. Static get (rand):499 ms. Dynamic get (rand):483 ms. 
=======Test Get (Rand) 10000
Norm get (rand):5190 ms. Static get (rand):5362 ms. Dynamic get (rand):5236 ms.

re：↑ - 有关于此→||←不同设置之间的差异。

关于我的原始答案（位于第一条水平线以上的位置），它是故意的glib，因为在大多数情况下，this type of micro-optimising is not good。

Answer 3

在实施方面，Google Guava具有便捷的工厂方法

Maps.newHashMapWithExpectedSize(expectedSize)

calculates the capacity使用公式

capacity = expectedSize / 0.75F + 1.0F

Answer 4

来自HashMap JavaDoc：

作为一般规则，默认加载因子（.75）在时间和空间成本之间提供了良好的权衡。较高的值会减少空间开销，但会增加查找成本（反映在HashMap类的大多数操作中，包括get和put）。在设置其初始容量时，应考虑映射中的预期条目数及其加载因子，以便最小化重新散列操作的数量。如果初始容量大于最大条目数除以加载因子，则不会发生任何重新连接操作。

因此，如果您期望100个条目，那么负载因子0.75和初始容量上限（100 / 0.75）将是最佳的。这可以归结为134。

我必须承认，我不确定为什么查找成本会更高，因为更高的负载系数。仅仅因为HashMap更“拥挤”并不意味着更多的对象将被放置在同一个桶中，对吧？这只取决于他们的哈希码，如果我没有弄错的话。因此，假设散列码散布良好，大多数情况下，无论负载因素如何，大多数情况下仍然不应为O（1）？

编辑：我应该在发布之前阅读更多...当然哈希码不能直接映射到某些内部索引。必须将其减小到适合当前容量的值。这意味着您的初始容量越大，您可以预期的哈希冲突数量就越小。选择一个与载荷因子为1的对象集大小（或+1）的初始容量，确实可以确保您的地图永远不会调整大小。但是，会终止您的查找和插入效果。调整大小仍然相对较快，并且可能只发生一次，而查找几乎与地图相关的任何相关工作完成。因此，优化快速查找是您真正想要的。您可以将其与不必调整大小相结合，如JavaDoc所说：采用所需的容量，除以最佳负载因子（例如0.75）并将其用作初始容量，并使用该负载因子。添加1以确保舍入不会得到你。

固定大小的HashMap的最佳容量和负载因子是多少？

5 个答案: