java huffman压缩器输出比原来大

时间:2018-04-02 15:21:45

标签: java huffman-code

我正在做家庭作业的霍夫曼压缩器,我设法为所有的char构建了霍夫曼树和0和1的代码,但输出文件比原始文件大。 这里有一个像我这样的问题 Unable to compress file during Huffman Encoding in Java 但我没有得到它。 我的代码:

        this.HuffmanTreeBulid();////create the Huffman tree
        HuffmanNode root =tree; 
        this.codeGenerator(root, codes);//create the hushmap

        try 
        {
            FileOutputStream out2 = new FileOutputStream(fileOut);//for the new file
            FileInputStream in = new FileInputStream(fileInput);//for reading again the orignal file
            FileWriter out = new FileWriter(fileOut);
            //String code;
            char currentchar;
            int currentByte;//int for going on all the bytes from the file
            if(!fileOut.exists())//if new file exits then replace it if not create it 
                fileOut.createNewFile();
            else
            {
                fileOut.delete();
                fileOut.createNewFile();
            }



            while((currentByte = in.read())!=-1)
            {
                int currentint =currentByte& 0xff;//"& 0xff" is for unsigned int 
                currentchar=(char)currentint;
                byte[] c=(huffmanCodes.get(currentchar)).getBytes();
                //out.write(huffmanCodes.get(code2));
                //out.write(huffmanCodes.get(currentchar));//for FileWriter
                out2.write(c);
            }
            in.close();
            out.close();
            out2.close();
        } 
        catch (IOException e) 
        {
                e.printStackTrace();
        }   

updete 1: 我理解这个问题,所以我做了这个

         int bitIndex = 0;
            for (int i=0;i<codes.length();i++)
            {
                if(codes.charAt(i)=='1')
                    buffer.set(bitIndex++);
                else
                    buffer.clear(bitIndex++);
            }

仍在努力工作:(

updete 2:我这样做是为了从字符串中获取字节

             byte[] bytes = new BigInteger(binaryString, 2).toByteArray();
                for (byte b : bytes) 
                {
                    out2.write(b);
                }

仍然无法工作,但它的关闭我可以到现在为止 也许这个字节很好,但我写的方式错了?

2 个答案:

答案 0 :(得分:2)

问题如下:

 byte[] c=(huffmanCodes.get(currentchar)).getBytes();

您尝试将编码的字符串设置为裸位和字节。但实际上,getBytes()只返回平台标准中编码的bytesequence。因此,您可能获得字符“1”的UTF-8字节编码和字符“0”的UTF-8字节编码。 您必须将String解析为一个字节。你可以在这里看到如何做到这一点: java: convert binary string to int

或在这里: How to convert binary string to a byte?

您可以在此处阅读有关getBytes方法的更多信息: https://beginnersbook.com/2013/12/java-string-getbytes-method-example/

正如@ 9000所说,你没有比特流。

使用压缩器比特流可能比使用完整字节更合适。所以解析一个完整的字节不会压缩你的字符串,因为char仍然是char的大小。

你可以做的是连接生成的二进制字符串,然后在最后将字符串解析为字节。请注意尾随零。

答案 1 :(得分:1)

我建议添加如下内容:

class BitstreamPacker {
  private int bitPos;  // Actual values 0..7; where to add the next bit.
  private ArrayList<Byte> data;

  public addBit(bool bit) {
    // Add the bit to the last byte of data; allocate more if does not fit.
    // Adjusts bitPos as it goes.
  }

  public void writeBytes(ByteOutputStream output) {
    // Writes the number of bytes, then the last bit pos, then the bytes. 
  }
}

类似地,

class BitstreamUnpacker {
  private byte[] data; // Or ArrayList if you wish.
  private int currentBytePos;
  private int currentBitPos;  // Could be enough to track the global bit position.
  public static BitstreamUnpacker fromByteStream(ByteInputStream input) {
    // A factory method; reads the stream and creates an instance.
    // Uses the byte count to allocate the right amount of bytes;
    // uses the bit count to limit the last byte to the actual number of bits.
    return ...;
  }

  public Bool getNextBit() {
    // Reads bits sequentially from the internal data.
    // Returns null when the end of data is reached.
    // Or feel free to implement an iterator / iterable.
  }
}

请注意,位流可能在字节的中间结束,因此需要在最后一个字节中存储位数。

为了帮助您更好地理解这个想法,这里有一些Python代码(因为Python很容易以交互方式玩):

class BitstreamPacker(object):

    def __init__(self):
        self.data = []  # A list of bytes.
        self.bit_offset = 0  # 0..7.

    def add_bit(self, bit):
        if self.bit_offset == 0:  # We must begin a new byte.
            self.data.append(0)  # Append a new byte.
        # We use addition because we know that the bit we're affecting is 0.
        # [-1] means last element.
        self.data[-1] += (bit << self.bit_offset)
        self.bit_offset += 1
        if self.bit_offset > 7:  # We've exceeded one byte.
            self.bit_offset = 0  # Shift the offset to the beginning of a byte.

    def get_bytes(self):
        # Just returning the data instead of writing, to simplify interactive use.
        return (len(self.data), self.bit_offset, self.data)

如何使用Python REPL?

>>> bp = BitstreamPacker()
>>> bp.add_bit(1)
>>> bp.add_bit(1)
>>> bp.get_bytes()
(1, 2, [3]) # One byte, two bits in it are used.
>>> bp.add_bit(0)
>>> bp.add_bit(0)
>>> bp.add_bit(0)
>>> bp.add_bit(1)
>>> bp.add_bit(1)
>>> bp.add_bit(1)
>>> bp.get_bytes()
(1, 0, [227])  # Whole 8 bits of one byte were used.
>>> bp.add_bit(1)
>>> bp.get_bytes()
(2, 1, [227, 1])  # Two bytes used: one full, and one bit in the next.
>>> assert 0b11100011 == 227  # The binary we sent matches.
>>> _

我希望这会有所帮助。