反序列化后Hashmap变慢 - 为什么?

时间:2015-02-15 12:17:29

标签: java hashmap deserialization

我有一个非常大的Hashmap(~250MB)。创建它大约需要50-55秒,所以我决定将其序列化并将其保存到文件中。从文件中读取大约需要16-17秒。

唯一的问题是查找似乎比较慢。我一直认为hashmap是从文件读入内存的,因此与我自己创建hashmap的情况相比,性能应该是相同的,对吧?这是我用来将hashmap读入文件的代码:

File file = new File("omaha.ser");
FileInputStream f = new FileInputStream(file);
ObjectInputStream s = new ObjectInputStream(new BufferedInputStream(f));
omahaMap = (HashMap<Long, Integer>) s.readObject();
s.close();
当我自己创建hashmap时,有3亿次查找需要大约3.1秒,而当我从文件中读取相同的hashmap时大约需要8.5秒。有人知道为什么吗?我忽略了一些明显的东西吗

修改

我通过使用System.nanotime()获取时间来“测量”时间,因此没有使用适当的基准测试方法。这是代码:

public class HandEvaluationTest
{
    public static void Test()
    {

        HandEvaluation.populate5Card();
        HandEvaluation.populate9CardOmaha();


        Card[] player1cards = {new Card("4s"), new Card("2s"), new Card("8h"), new Card("4d")};
        Card[] player2cards = {new Card("As"), new Card("9s"), new Card("6c"), new Card("2h")};
        Card[] player3cards = {new Card("9h"), new Card("7h"), new Card("Kc"), new Card("Kh")};
        Card[] table = {new Card("2d"), new Card("2c"), new Card("3c"), new Card("5c"), new Card("4h")};


        int j=0, k=0, l=0;
        long startTime = System.nanoTime();
        for(int p=0; p<100000000; p++)    {
           j = HandEvaluation.handEval9Hash(player1cards, table);
            k = HandEvaluation.handEval9Hash(player2cards, table);
            l = HandEvaluation.handEval9Hash(player3cards, table);

        }
        long estimatedTime = System.nanoTime() - startTime;
        System.out.println("Time needed: " + estimatedTime*Math.pow(10,-6) + "ms");
        System.out.println("Handstrength Player 1: " + j);
        System.out.println("Handstrength Player 2: " + k);
        System.out.println("Handstrength Player 3: " + l);
    }
}

大型hashmap工作在HandEvaluation.populate9CardOmaha()中完成。 5张卡很小。大单的代码:

 public static void populate9CardOmaha()
        {

            //Check if the hashmap is already there- then just read it and exit
            File hashmap = new File("omaha.ser");
            if(hashmap.exists())
            {
                try
                {
                    File file = new File("omaha.ser");
                    FileInputStream f = new FileInputStream(file);
                    ObjectInputStream s = new ObjectInputStream(new BufferedInputStream(f));
                    omahaMap = (HashMap<Long, Integer>) s.readObject();
                    s.close();
                }
                catch(IOException ioex) {ioex.printStackTrace();}
                catch(ClassNotFoundException cnfex)
                {
                    System.out.println("Class not found");
                    cnfex.printStackTrace();
                    return;
                }
                return;
            }

    // if it's not there, populate it yourself
    ... Code for populating hashmap ...
    // and then save it to file
          (

            try
            {
                File file = new File("omaha.ser");
                FileOutputStream f = new FileOutputStream(file);
                ObjectOutputStream s = new ObjectOutputStream(new BufferedOutputStream(f));
                s.writeObject(omahaMap);
                s.close();
            }
            catch(IOException ioex) {ioex.printStackTrace();}
        }

当我自己填充它(=文件不在这里)时,HandEvaluationTest.Test()中的查找大约需要8秒而不是3.也许这只是我非常天真的测量时间的方法?

2 个答案:

答案 0 :(得分:3)

这个问题很有意思,所以我编写了自己的测试用例来验证它。我发现实时查找的速度与从序列化文件加载的速度没有区别。任何有兴趣运行它的人都可以在帖子的末尾找到该程序。

  • 使用JProfiler监控方法。
  • 序列化文件与您的文件相当。 〜230 MB
  • 内存中的查找成本 1210 毫秒没有任何序列化

enter image description here

  • 序列化地图并再次阅读后,查找费用保持不变(差不多 - 1224 ms)

enter image description here

  • 调整了分析器以在两种情况下增加最小的开销。
  • 这是在Java(TM) SE Runtime Environment (build 1.6.0_25-b06) / 4 CPUs running at 1.7 Ghz / 4GB Ram 800 Mhz
  • 上衡量的

测量很棘手。我自己注意到你描述的8 second查询时间,但猜测发生这种情况时我还注意到了什么。

GC活动

enter image description here

你的测量结果可能也是如此。如果单独隔离Map.get()的度量值,您会发现结果具有可比性。


public class GenericTest
{
    public static void main(String... args)
    {
        // Call the methods as you please for a live Vs ser <-> de_ser run
    }

    private static Map<Long, Integer> generateHashMap()
    {
        Map<Long, Integer> map = new HashMap<Long, Integer>();
        final Random random = new Random();
        for(int counter = 0 ; counter < 10000000 ; counter++)
        {
            final int value = random.nextInt();
            final long key = random.nextLong();
            map.put(key, value);
        }
        return map;
    }

    private static void lookupItems(int n, Map<Long, Integer> map)
    {
        final Random random = new Random();
        for(int counter = 0 ; counter < n ; counter++)
        {
            final long key = random.nextLong();
            final Integer value = map.get(key);
        }
    }

    private static void serialize(Map<Long, Integer> map)
    {
        try
        {
            File file = new File("temp/omaha.ser");
            FileOutputStream f = new FileOutputStream(file);
            ObjectOutputStream s = new ObjectOutputStream(new BufferedOutputStream(f));
            s.writeObject(map);
            s.close();
        }
        catch (Exception e)
        {
            e.printStackTrace();
        }
    }

    private static Map<Long, Integer> deserialize()
    {
        try
        {
            File file = new File("temp/omaha.ser");
            FileInputStream f = new FileInputStream(file);
            ObjectInputStream s = new ObjectInputStream(new BufferedInputStream(f));
            HashMap<Long, Integer> map = (HashMap<Long, Integer>) s.readObject();
            s.close();
            return map;
        }
        catch (Exception e)
        {
            throw new RuntimeException(e);
        }
    }
}

答案 1 :(得分:2)

  当我自己创建hashmap时,有3亿次查找需要大约3.1秒,而当我从文件中读取相同的hashmap时大约需要8.5秒。有人知道为什么吗?我忽略了一些明显的东西吗

一个可能的原因是重建的HashMap可能没有与原始HashMap相同的容量(桶的数量),这可能会增加散列冲突的频率,或者(如果大小增加)会降低主内存访问的位置(导致更多缓存未命中)。要进行验证,请使用调试器在重建之前和之后检查map.table的长度。如果确实如此,请尝试使用适当的loadFactor将数据复制到新的HashMap中。

为什么序列化不能保持容量:HashMap通过提供writeObject和readObject方法自定义其序列化格式(为每个空表元素序列化null没有意义),并忽略它在输入流中找到的容量: / p>

/**
 * Reconstitute the {@code HashMap} instance from a stream (i.e.,
 * deserialize it).
 */
private void readObject(java.io.ObjectInputStream s)
    throws IOException, ClassNotFoundException {
    // Read in the threshold (ignored), loadfactor, and any hidden stuff
    s.defaultReadObject();
    reinitialize();
    if (loadFactor <= 0 || Float.isNaN(loadFactor))
        throw new InvalidObjectException("Illegal load factor: " +
                                         loadFactor);
    s.readInt();                // Read and ignore number of buckets
    int mappings = s.readInt(); // Read number of mappings (size)
    if (mappings < 0)
        throw new InvalidObjectException("Illegal mappings count: " +
                                         mappings);
    else if (mappings > 0) { // (if zero, use defaults)
        // Size the table using given load factor only if within
        // range of 0.25...4.0
        float lf = Math.min(Math.max(0.25f, loadFactor), 4.0f);
        float fc = (float)mappings / lf + 1.0f;
        int cap = ((fc < DEFAULT_INITIAL_CAPACITY) ?
                   DEFAULT_INITIAL_CAPACITY :
                   (fc >= MAXIMUM_CAPACITY) ?
                   MAXIMUM_CAPACITY :
                   tableSizeFor((int)fc));
        float ft = (float)cap * lf;
        threshold = ((cap < MAXIMUM_CAPACITY && ft < MAXIMUM_CAPACITY) ?
                     (int)ft : Integer.MAX_VALUE);
        @SuppressWarnings({"rawtypes","unchecked"})
            Node<K,V>[] tab = (Node<K,V>[])new Node[cap];
        table = tab;

        // Read the keys and values, and put the mappings in the HashMap
        for (int i = 0; i < mappings; i++) {
            @SuppressWarnings("unchecked")
                K key = (K) s.readObject();
            @SuppressWarnings("unchecked")
                V value = (V) s.readObject();
            putVal(hash(key), key, value, false, false);
        }
    }
}

我怀疑它忽略了防止拒绝服务攻击的数量,攻击者制作了序列化流,并提供了一个不切实际的高(或低)数量的桶,这会导致OutOfMemoryError(或者CPU负载过高)哈希冲突),这对于接受来自不受信任来源的序列化数据的任何应用程序进行拒绝服务攻击是一种廉价的方法(CVE-2012-2739描述了这样的问题)。