Question

因此，我一直在尝试使用霍夫曼解码，并且我具有此工作功能，但它的时间和空间复杂性令人恐惧。到目前为止，我一直在做的工作是读取每个字节，获取每个位并将其添加到currentBitString中。然后，我反转了字符串，并将其添加到一个巨大的字符串中，该字符串基本上包含了文件的所有字节数据。之后，我将跟踪巨型字符串并查找霍夫曼代码，然后如果匹配，我将写入文件。这段代码解码一个200kb大约需要60秒，这很糟糕，但是我不确定如何改进它？我知道我可以初学者一次向文件写入一个以上的字节，但是这样做似乎并没有缩短时间？

         public static void decode(File f) throws Exception {

    BufferedInputStream fin = new BufferedInputStream(new FileInputStream(f));
    int i = f.getName().lastIndexOf('.');
    String extension="txt";
    String newFileName=f.getName().substring(0, i)+extension;
    File nf = new File(newFileName);
    BufferedOutputStream fw = new BufferedOutputStream(new FileOutputStream(nf));
    int c;
    byte bits;
    byte current;
    String currentBitString="";
    String bitString="";
    //read each byte from file, reverse it, add to giant bitString
    //reads ALL BYTES
    while( (c=fin.read())!=-1 ) {
        current=(byte) c;
        currentBitString="";
        bits=0;
        for(int q=0;q<8;q++) {
            bits=getBit(current,q);
            currentBitString+=bits;
        }
        StringBuilder bitStringReverse=new StringBuilder(currentBitString);
        bitString+=bitStringReverse.reverse().toString();
    }
    currentBitString="";
    boolean foundCode=false;
    for(int j=0;j<bitString.length();j++) {
        currentBitString+=bitString.charAt(j);
        for(int k=0;k<nodes.length;k++) {
            //nodes is an array of huffman nodes which contains the the byte 
            //data and the huffman codes for each byte
            if(nodes[k].code.compareTo(currentBitString.trim())==0) {
                fw.write(nodes[k].data);    
                foundCode=true;
                break;
            }
        }
        if(foundCode) {
            currentBitString="";
            foundCode=false;
        }

    }
    fw.flush();
    fw.close();
    fin.close();

}

这是gitBit函数

        public static byte getBit(byte ID, int position) {
        // return cretin bit in selected byte
        return (byte) ((ID >> position) & 1);
        }

这是HuffmanNode类的数据成员（nodes数组是HuffmanNodes的数组）

       public class HuffmanNode{
       byte data;
       int repetitions;
       String code;
       HuffmanNode right;
       HuffmanNode left;
       }

Answer 1

您可以将字符串缩写+=替换为StringBuilder。这样可以分配较少的对象，并减轻垃圾收集器的负担。

int c;
StringBuilder bitString = new StringBuilder();
//read each byte from file, reverse it, add to giant bitString
//reads ALL BYTES
while ((c = fin.read()) != -1) {
    byte current = (byte) c;
    StringBuilder currentBitString = new StringBuilder();
    for (int q = 0; q < 8; q++) {
        byte bits = getBit(current, q);
        currentBitString.append(bits);
    }
    bitString.append(currentBitString.reverse());
}

您应该在此处使用nodes，而不是将代码和数据放入数组HashMap中。您通过遍历整个数组直到找到正确的匹配来比较代码。平均每个项目有n/2个对String#equals()的调用。使用HashMap，您可以将其减少到〜1。

使用代码的数据作为键填充地图。

Map<String, Integer> nodes = new HashMap<>();
nodes.put(code, data);

从地图访问数据

boolean foundCode = false;
for (int j = 0; j < bitString.length(); j++) {
    currentBitString.append(bitString.charAt(j));
    Integer data = nodes.get(currentBitString.toString().trim());
    if (data != null) {
        fw.write(data);
        foundCode = true;
    }
    if (foundCode) {
        currentBitString = new StringBuilder();
        foundCode = false;
    }
}

Answer 2

不要将整个内容读到内存中。处理遇到的代码。读取足够的位以解码下一个代码，对其进行解码，保留未使用的位供后续代码使用，然后重复。
不要使用字符串来表示位，在每个字符中只表示一位。用位表示位。 shift，and和or运算符是您应该使用的。您将有一个整数作为位缓冲区，其中包含解码下一个代码所需的所有位。
不要对所有代码长度进行搜索，而是在其中对所有代码进行线性搜索以找到您的代码！我将很难提出一种较慢的方法。您应该使用树后裔或表查找进行解码。如果首先生成canonical Huffman code，则可以实现一种简单的查找方法。有关示例，请参见puff.c。教科书的方法（比puff.c的方法要慢）是在接收端建立相同的霍夫曼树，然后一点一点地往下钻，直到得到一个符号。发出符号并重复。

您应该能够在现代处理器的单个内核上在几毫秒内处理200K压缩输入。

如何优化霍夫曼解码？

2 个答案: