Question

我正在尝试用Java实现一个带有203675个单词的trie结构用于文本编辑器。

之前，我使用ArrayList存储单词，占用了90兆字节的空间。所以我想使用trie来减少空间消耗。

这是我到目前为止所拥有的，但现在空间消耗为250兆字节。这种增加的原因是什么？

package TextEditor;

import java.io.*;
import java.util.*;
import javax.swing.JOptionPane;

class Vertex {
    int words;
    Map<Character, Vertex> child;
    public Vertex() {
        words = 0;
        child = new HashMap<>();
    }
}
class Trie {
    private Vertex root;
    private InputStream openFile;
    private OutputStream openWriteFile;
    private BufferedReader readFile;
    private BufferedWriter writeFile;
    public Trie() {
        root = new Vertex();
    }
    public Trie(String path) {
         try {
            root = new Vertex();
            openFile = getClass().getResourceAsStream(path);
            readFile = new BufferedReader( new InputStreamReader(openFile));
            String in = readFile.readLine();
                    while(readFile.ready()) {
                        this.insert(in);
                    try {
                        in = readFile.readLine();
                    } catch (IOException ex) {
                        JOptionPane.showMessageDialog(null, 
                            "TRIE CONSTRUCTION ERROR!!!!");
                    }
                    }
        } catch (IOException ex) {
            JOptionPane.showMessageDialog(null, 
                "TRIE CONSTRUCTION ERROR!!!!");
        }
    }
    private void addWord(Vertex vertex, String s, int i) {
        try {
        if(i>=s.length()) {
            vertex.words += 1;
            return;
        }
        char ind  = s.charAt(i);
        if(!vertex.child.containsKey(ind)) {
            vertex.child.put(ind, new Vertex());
        }
    addWord(vertex.child.get(ind), s, i+1);
        } catch(Exception e) {
            e.printStackTrace();
            System.exit(1);
        }
    }
    final void insert(String s) {
        addWord(root, s.toLowerCase(), 0);
    }
    private void DFS(Vertex v, String s, ArrayList list, 
        boolean store, String startsWith, int ind) {
    if(v != null && v.words != 0) {
            if(!store) {
                System.out.println(s);
            }
            else {
                if(s.length() >= startsWith.length()) {
                    list.add(s);
                }
            }
        }
        for (Map.Entry<Character, Vertex> entry : v.child.entrySet()) {
            Character c = entry.getKey();
            if((startsWith ==  null) || (ind>=startsWith.length()) || 
                (startsWith.charAt(ind) == c)) {
                    DFS(v.child.get(c), s + c, list, store, startsWith, ind+1);
             }
        }
    }
    public void Print() {
        DFS(root, new  String(""), null, false, null, 0);
    }
    ArrayList<String> getAsList(String startsWith) {
        ArrayList ret = new ArrayList();
        DFS(root, new  String(""), ret, true, startsWith, 0);
        return ret;
    }
    int count(Vertex  vertex, String s, int i) {
    if(i >= s.length()) {
            return vertex.words;
        }
    if(!vertex.child.containsKey(s.charAt(i))) {
            return 0;
        }
        return count(vertex.child.get(s.charAt(i)), s, i+1);
    }
    int count(String s) {   
        return count(root, s, 0);
    }
}

我可以使用trie结构的工作示例吗？

Answer 1

您对“空间”一词的使用含糊不清。根据你的描述，这听起来像你在谈论堆。如果是这样，增加内存使用的原因是像trie这样的数据结构实际上占用了额外的内存来存储节点之间的引用。 ArrayList只包含所有内容，一个String引用一个接一个，并且除了数组之外没有任何其他信息。特里有更多的簿记来指定所有节点之间的关系。

特别是，每个顶点中的HashMap将非常昂贵;默认情况下，Sun实现为16个条目的映射分配足够的空间，这需要存储地图自己的内存分配记录hashCodes（32位int，而不是char s），每个Character的对象包装器......

Answer 2

首先，将数据结构（您的trie）与填充它的任何代码分开。它只需要以结构化的形式保存数据，并提供一些基本功能，就是这样。填充它应该发生在数据结构本身之外，这样您就可以正确处理流。没有一个充分的理由让你的trie通过提供一条路径作为一个参数来填补自己。为了澄清我的第一点 - 拉出trie的填充：目前，流在trie中吞噬了大量内存，因为它们被保存在私有变量中，并且流不会被关闭或破坏。这意味着你将文件加载到内存中的填充数据结构上。否则垃圾收集可以像使用arraylist一样清理这些项目。

请不要重新发明轮子并使用如下的基本实现。让它使用这个基本设置，并担心以后改进它。

public class Trie {

    private Map<String, Node> roots = new HashMap<>();

    public Trie() {}

    public Trie(List<String> argInitialWords) {
            for (String word:argInitialWords) {
                    addWord(word);
            }
    }

    public void addWord(String argWord) {
            addWord(argWord.toCharArray());
    }

    public void addWord(char[] argWord) {
            Node currentNode = null;

            if (!roots.containsKey(Character.toString(argWord[0]))) {
                    roots.put(Character.toString(argWord[0]), new Node(argWord[0], "" + argWord[0]));
            }

            currentNode = roots.get(Character.toString(argWord[0]));

            for (int i = 1; i < argWord.length; i++) {
                    if (currentNode.getChild(argWord[i]) == null) {
                            currentNode.addChild(new Node(argWord[i], currentNode.getValue() + argWord[i]));
                    }

                    currentNode = currentNode.getChild(argWord[i]);
            }

            currentNode.setIsWord(true);
    }

    public boolean containsPrefix(String argPrefix) {
            return contains(argPrefix.toCharArray(), false);
    }

    public boolean containsWord(String argWord) {
            return contains(argWord.toCharArray(), true);
    }

    public Node getWord(String argString) {
            Node node = getNode(argString.toCharArray());
            return node != null && node.isWord() ? node : null;
    }

    public Node getPrefix(String argString) {
            return getNode(argString.toCharArray());
    }

    @Override
    public String toString() {
            return roots.toString();
    }

    private boolean contains(char[] argString, boolean argIsWord) {
            Node node = getNode(argString);
            return (node != null && node.isWord() && argIsWord) || (!argIsWord && node != null);
    }

    private Node getNode(char[] argString) {
            Node currentNode = roots.get(Character.toString(argString[0]));

            for (int i = 1; i < argString.length && currentNode != null; i++) {
                    currentNode = currentNode.getChild(argString[i]);

                    if (currentNode == null) {
                            return null;
                    }
            }

            return currentNode;
    }
}

public class Node {

    private final Character ch;
    private final String value;
    private Map<String, Node> children = new HashMap<>();
    private boolean isValidWord;

    public Node(char argChar, String argValue) {
            ch = argChar;
            value = argValue;
    }

    public boolean addChild(Node argChild) {
            if (children.containsKey(Character.toString(argChild.getChar()))) {
                    return false;
            }

            children.put(Character.toString(argChild.getChar()), argChild);
            return true;
    }

    public boolean containsChildValue(char c) {
            return children.containsKey(Character.toString(c));
    }

    public String getValue() {
            return value.toString();
    }

    public char getChar() {
            return ch;
    }

    public Node getChild(char c) {
            return children.get(Character.toString(c));
    }

    public boolean isWord() {
            return isValidWord;
    }

    public void setIsWord(boolean argIsWord) {
            isValidWord = argIsWord;

    }

    public String toString() {
            return value;
    }

}

如果您正在考虑改进内存使用（以性能为代价），您可以通过以下方式（单独或组合）来实现

通过将对象Character切换为它的原始char形式，这将节省用于该对象的字节的开销以及任何内部私有变量
通过将Node的value参数切换为char []类型，您将在每个节点中保存另一个String对象
通过实现trie压缩和合并公共分支。这将消除对一堆节点的需求。将节省多少节点将取决于实际内容条目和输入的单词之间的类似性。模拟词越多，可以压缩的trie越少，节省的节点越少。因此，将释放更少的内存
通过将hashmap实现切换到更加内存友好的实现（以查找和插入速度为代价）。最有效的方法是数据结构，它不会占用比保存密钥所需的更多内存。例如：如果已知一个节点只能容纳3个密钥，则在内存消耗方面，长度为3的数组最适合该节点。在实践中，具有低启动容量的sortedSet在内存消耗方面应该比散列映射更好，因为您不需要保留哈希码，但是比数组更容易插入和搜索。

一般来说，一个实现良好的trie，并且我强调实施良好应该大约相当于你输入的同一数据集的90Mb的内存消耗，尽管它完全取决于实际的数据集。

如果您设法将大多数单词不是任何其他单词的前缀的单词列表放在一起。您的内存使用量将远远大于ArrayList，因为您需要更多节点来表示相同的内容。

如果你真的想为真正的随机数据集保存一些内存，你应该看看Burst tries，另一个可行的选择可能是patricia trie。

优化java程序中trie结构中的空间使用

2 个答案: