Question

您好，

我目前正在研究Java中的单词预测。为此，我使用的是基于NGram的模型，但我有一些内存问题......

第一次，我有一个这样的模型：

public class NGram implements Serializable {
    private static final long serialVersionUID = 1L;

    private transient int count;
    private int id;
    private NGram next;

    public NGram(int idP) {
        this.id = idP;
    }
}

但这需要大量的记忆，所以我认为我需要优化，我想，如果我有“向世界问好”和“向人民问好”，而不是得到两个ngram，我可以保留一个保持“你好”，然后有两个可能：“人”和“世界”。

更清楚的是，这是我的新模式：

public class BNGram implements Serializable {
    private static final long serialVersionUID = 1L;
    private int id;
    private HashMap<Integer,BNGram> next;
    private int count = 1;

    public BNGram(int idP) {
        this.id = idP;
        this.next = new HashMap<Integer, BNGram>();
    }
}

但似乎我的第二个模型消耗了两倍的内存...我认为这是因为HashMap，但我不知道如何减少这个？我尝试使用不同的Map实现，比如Trove或其他，但它不会改变任何东西。

为了给你一个想法，对于一个9MB的文本，带有57818个不同的单词（不同，但它不是单词的总数），在NGram生成之后，我的javaw进程消耗1.2GB的内存...... 如果我用GZIPOutputStream保存它，它在磁盘上需要大约18MB。

所以我的问题是：如何使用更少的内存？我可以用压缩制作东西（作为序列化）。我需要将其添加到其他应用程序中，因此我需要在...之前减少内存使用量。

非常感谢，抱歉我的英语不好......

ZiMath

Answer 1

您需要一个专门的结构来实现您想要的目标。

看看Apache's PatriciaTrie。它就像一个Map，但它具有记忆性，可以与String一起使用。它也非常快：操作是O(k)，k是最大密钥的位数。

它有一个适合您眼前需求的操作：prefixMap()，它返回包含SortedMap的trie的String视图，该视图以给定密钥为前缀。

一个简短的用法示例：

public class Patricia {

    public static void main(String[] args) {

        PatriciaTrie<String> trie = new PatriciaTrie<>();

        String world = "hello the world";
        String people = "hello the people";

        trie.put(world, null);
        trie.put(people, null);

        SortedMap<String, String> map1 = trie.prefixMap("hello");
        System.out.println(map1.keySet());  // [hello the people, hello the world]

        SortedMap<String, String> map2 = trie.prefixMap("hello the w");
        System.out.println(map2.keySet()); // [hello the world]

        SortedMap<String, String> map3 = trie.prefixMap("hello the p");
        System.out.println(map3.keySet());  // [hello the people]
    }
}

还有the tests，其中包含更多示例。

Answer 2

在这里，我主要是试图解释为什么你观察到如此过多的内存消耗，以及你可以做些什么（如果你想坚持HashMap）：

使用默认构造函数创建的HashMap的初始容量为16.这意味着它将有16个条目的空间，即使它是空的。此外，无论是否需要，您似乎都在创建地图。

在你的情况下减少内存消耗的方法是

仅在必要时创建地图
使用较小的初始容量创建

适用于您的课程，大概如下：

public class BNGram {
    private int id;
    private Map<Integer,BNGram> next;

    public BNGram(int idP) {
        this.id = idP;
        // (Do not create a new `Map` here!)
    }

    void doSomethingWhereTheMapIsNeeded(Integer key, BNGram value) {

        // Create a map, when required, with an initial capacity of 1
        if (next == null) {
            next = new HashMap<Integer, BNGram>(1);
        }
        next.put(key, value);
    }
}

但是...

......从概念上讲，拥有一个由许多很多地图组成的大型“树”结构是值得怀疑的，每个地图只有“少数”条目。这表明不同的数据结构在这里更合适。因此，你绝对应该选择answer by Magnamag中的解决方案，或者（如果这不适用于你，如你的评论中所建议的那样），请注意另一种数据结构 - 甚至可以将其设计为不受XY Problem影响的新问题。

Java中的大量Object（使用HashMap）

2 个答案:

但是...