Question

我正在学习一个算法课程，我们必须在其中用 Java 实现 LZW 压缩。我决定为此使用 Trie 数据结构，并且我已经实现了 Trie 并使其正常工作，但是非常很慢，而且几乎没有压缩。

我们应该使用 8 位符号并且能够压缩任何二进制文件。

给定一个大约 4MB 的文件 (bible.txt)，我的代码数组中有大约 549,012 个元素。当我将这些元素写入一个文件（每行一个整数代码）时，我最终得到了一个 3.5MB 的“压缩”文件，所以我得到了 0.5MB 的压缩。

我怎样才能使这个程序更有效率？我觉得我在这里误解了一些基本的东西，或者我遗漏了一些明显的东西，但我不知道为什么这不会压缩。

（我从这个网站得到了我的测试文件 bible.txt：https://corpus.canterbury.ac.nz/descriptions/）

我从这样的二进制文件中读取字节（读取为 int 并转换为 char 是必要的，以便 0x80 以上的值不是负数）：

public String readFile(String path) throws IOException, FileNotFoundException {
    File file = new File(path);

    StringBuilder string = new StringBuilder();

    try (FileInputStream fileInputStream = new FileInputStream(file)) {
        int singleCharInt;
        char singleChar;
        while((singleCharInt = fileInputStream.read()) != -1) {
            singleChar = (char) singleCharInt;
            string.append(singleChar);
        }
    } 

    return string.toString();
}

我的主要方法如下：

    public static void main(String args[]) throws FileNotFoundException, IOException {
        String bytes = new FileReader().readFile("/home/user/Code/Trie/bible.txt");

        ArrayList<Integer> codes = new Compress().compress(bytes);
    }

我的 Compress 类看起来像这样：

public class Compress {

    private int code = 0;

    public ArrayList<Integer> compress(String data) {
        Trie trie = new Trie();

        // Initialize Trie Data Structure with alphabet (256 possible values with 8-bit
        // symbols)
        for (code = 0; code <= 255; code++) {
            trie.insert(Character.toString((char) code), code);
        }

        code++;

        String s = Character.toString(data.charAt(0));

        ArrayList<Integer> codes = new ArrayList<Integer>();

        for (int i = 1; i < data.length(); i++) {
            String c = Character.toString(data.charAt(i));

            if (trie.find(s + c) > 0) {
                s += c;
            } else {
                codes.add(trie.find(s));
                trie.insert(s + c, code);
                code++;
                s = c;
            }
        }

        codes.add(trie.find(s));

        return codes;
    }

}

我的 Trie 类如下所示：

public class Trie {
    private TrieNode root;

    public Trie() {
        this.root = new TrieNode(false);
    }

    public void insert (String word, int code) {
        TrieNode current = root;

        for (char l: word.toCharArray()) {
            current = current.getChildren().computeIfAbsent(Character.toString(l), c -> new TrieNode(false));
        }
        current.setCode(code);
        current.setWordEnd(true);
    }

    public int find(String word) {
        TrieNode current = root;

        for (int i = 0 ; i < word.length(); i++) {
            char ch = word.charAt(i);

            TrieNode node = current.getChildren().get(Character.toString(ch));

            if (node == null) {
                return -1;
            }

            current = node;
        }

        return current.getCode();
    }
}

我的 TrieNode 类如下所示：

public class TrieNode {
    private HashMap<String, TrieNode> children;
    private int code;
    private boolean wordEnd;

    public TrieNode(boolean wordEnd) {
        this.children = new HashMap<String, TrieNode>();
        this.wordEnd = wordEnd;
    }

    public HashMap<String, TrieNode> getChildren() {
        return this.children;
    }

    public void setWordEnd(boolean wordEnd) {
        this.wordEnd = wordEnd;
    }
    
    public boolean isWordEnd() {
        return this.wordEnd;
    }

    public int getCode() {
        return this.code;
    }

    public void setCode(int code) {
        this.code = code;
    }
}

感谢您的宝贵时间！

Answer 1

这是什么意思：“当我将这些元素写入文件时（每行一个整数代码）”？您为每个代码向文件写入四个字节？你正在写四个字节和新行？你正在写一个十进制数字和一个新行？

无论如何，所有这些都是错误的。您需要将代码存储为 bits。在通常的 LZW 实现中，代码中的位数从 9 开始，然后随着创建更多代码而递增。在文件中，代码可能是例如 12 位或 13 位。解码器从数据中知道编码器何时递增，因此它始终知道为下一个代码获取多少位。每隔一段时间重置回 9 位是有意义的，这是由编码器发送给解码器的信号。

那么你如何在文件中读取和写入位？您很快就会发现没有相应的功能。您需要自己编写它们。

简而言之，您在一个整数中保留一个位缓冲区，使用位移和/或操作将位添加到缓冲区中，并跟踪缓冲区中有多少位在另一个整数中。对于编码，在向缓冲区添加位后，您会看到那里是否至少有 8 位。如果是，则将一个字节写入文件，并从缓冲区中删除 8 位。重复直到缓冲区中少于 8 位。

最后必须注意将最后几位写成另一个字节，确保您已经考虑过解码器如何知道何时停止解码位。

解码器方面也一样，从输入文件中读取字节并一次向缓冲区添加 8 位，直到您有足够的位来提取下一个代码。

我的 LZW 压缩程序几乎没有压缩

1 个答案: