压缩小文件的算法(345字节)

时间:2016-01-14 18:06:20

标签: algorithm compression

我正在寻找能够无损压缩少量噪音数据的软件;例如:

ZmNiYWNma3F5gYqSmqCkpqenpKGdmJONiIN/e3d1c3FxcXFxcnN0dXZ3eXp7fHx9fn5/gIGDhISFhYWFhIOCgYGAgICBgoOEhYeIiYmJiYiGhIF+fHl3dHNyc3N1d3p9gIOGiYuMjY2NjIuKiIeFg4GAfnx7eXl4eHl5enx9fn5/gICAgYGBgYGBgYGBgoKDhISEhISEhIOCgoGAgH9/fn5+fn5/gICAgICAgICAgICAgICAf39/gICBgoOFhoeIiIiIiIeGhYOCgH99fHt6enp6e3x9foCAgoKDg4ODgoKBgH59fHt7e3t8fX5/gIKDhIWFhoaGhoaFhISDgoKBgQ==

原始数据:345B(100%), gzip:280B(81%), bzip2:289B(84%), lzop:415B(120%),

我还有其他方法吗?

2 个答案:

答案 0 :(得分:6)

由于数据是Base64编码的(每3个字节变成4个字节),第一步就是解码它(压缩数据无论如何都是二进制的):

344 bytes -> 256 bytes

然后,使用标准winzip的简单测试显示压缩到170 bytesCOMP_DEFLATE块)。你应该和gzip / zlib一样。

压缩系数越高,这可能会有所变小。

压缩原始数据会产生243个字节(.zip文件中的数据块,完整的zip文件为359个字节,但您不需要所有额外的数据)。

因此,在解码数据上使用zlib时,应将其压缩为±170字节

查看解码数据,可以实现更好的压缩。但这取决于具有相同结构的其他数据。

已解码数据的十六进制转储(很多值正在重复,或只是略有变化):

66 63 62 61 63 66 6B 71 79 81 8A 92 9A A0 A4 A6
A7 A7 A4 A1 9D 98 93 8D 88 83 7F 7B 77 75 73 71
71 71 71 71 72 73 74 75 76 77 79 7A 7B 7C 7C 7D
7E 7E 7F 80 81 83 84 84 85 85 85 85 84 83 82 81
81 80 80 80 81 82 83 84 85 87 88 89 89 89 89 88
86 84 81 7E 7C 79 77 74 73 72 73 73 75 77 7A 7D
80 83 86 89 8B 8C 8D 8D 8D 8C 8B 8A 88 87 85 83
81 80 7E 7C 7B 79 79 78 78 79 79 7A 7C 7D 7E 7E
7F 80 80 80 81 81 81 81 81 81 81 81 81 82 82 83
84 84 84 84 84 84 84 83 82 82 81 80 80 7F 7F 7E
7E 7E 7E 7E 7F 80 80 80 80 80 80 80 80 80 80 80
80 80 80 80 7F 7F 7F 80 80 81 82 83 85 86 87 88
88 88 88 88 87 86 85 83 82 80 7F 7D 7C 7B 7A 7A
7A 7A 7B 7C 7D 7E 80 80 82 82 83 83 83 83 82 82
81 80 7E 7D 7C 7B 7B 7B 7B 7C 7D 7E 7F 80 82 83
84 85 85 86 86 86 86 86 85 84 84 83 82 82 81 81

只有在快速查看之后:平均每字节应该可以达到大约2到3位,从而导致 64 96字节

仔细查看数据:

大多数价值观都没有那么大的变化。如果所有数据都与此类似,则可以使用某些自定义代码实现高压缩率。例如,差异可以存储在1,2,3或4位,具体取决于数据块(仅第一个数据点需要4位)。另一种方法是使用现有算法(zlibHuffman coding和其他方法)压缩差异(增量值),而不是完全自定义代码。

包含2轮delta编码的十进制值

102        
99    -3    
98    -1    2
97    -1    0
99     2    3
102    3    1
107    5    2
113    6    1
121    8    2
129    8    0
138    9    1
146    8   -1
154    8    0
160    6   -2
164    4   -2
166    2   -2
167    1   -1
167    0   -1
164   -3   -3
161   -3    0
157   -4   -1
152   -5   -1
147   -5    0
141   -6   -1
136   -5    1
131   -5    0
127   -4    1
123   -4    0
119   -4    0
117   -2    2
115   -2    0
113   -2    0
113    0    2
113    0    0
113    0    0
113    0    0
114    1    1
115    1    0
116    1    0
117    1    0
118    1    0
119    1    0
121    2    1
122    1   -1
123    1    0
124    1    0
124    0   -1
125    1    1
126    1    0
126    0   -1
127    1    1
128    1    0
129    1    0
131    2    1
132    1   -1
132    0   -1
133    1    1
133    0   -1
133    0    0
133    0    0
132   -1   -1
131   -1    0
130   -1    0
129   -1    0
129    0    1
128   -1   -1
128    0    1
128    0    0
129    1    1
130    1    0
131    1    0
132    1    0
133    1    0
135    2    1
136    1   -1
137    1    0
137    0   -1
137    0    0
137    0    0
136   -1   -1
134   -2   -1
132   -2    0
129   -3   -1
126   -3    0
124   -2    1
121   -3   -1
119   -2    1
116   -3   -1
115   -1    2
114   -1    0
115    1    2
115    0   -1
117    2    2
119    2    0
122    3    1
125    3    0
128    3    0
131    3    0
134    3    0
137    3    0
139    2   -1
140    1   -1
141    1    0
141    0   -1
141    0    0
140   -1   -1
139   -1    0
138   -1    0
136   -2   -1
135   -1    1
133   -2   -1
131   -2    0
129   -2    0
128   -1    1
126   -2   -1
124   -2    0
123   -1    1
121   -2   -1
121    0    2
120   -1   -1
120    0    1
121    1    1
121    0   -1
122    1    1
124    2    1
125    1   -1
126    1    0
126    0   -1
127    1    1
128    1    0
128    0   -1
128    0    0
129    1    1
129    0   -1
129    0    0
129    0    0
129    0    0
129    0    0
129    0    0
129    0    0
129    0    0
130    1    1
130    0   -1
131    1    1
132    1    0
132    0   -1
132    0    0
132    0    0
132    0    0
132    0    0
132    0    0
131   -1   -1
130   -1    0
130    0    1
129   -1   -1
128   -1    0
128    0    1
127   -1   -1
127    0    1
126   -1   -1
126    0    1
126    0    0
126    0    0
126    0    0
127    1    1
128    1    0
128    0   -1
128    0    0
128    0    0
128    0    0
128    0    0
128    0    0
128    0    0
128    0    0
128    0    0
128    0    0
128    0    0
128    0    0
128    0    0
128    0    0
127   -1   -1
127    0    1
127    0    0
128    1    1
128    0   -1
129    1    1
130    1    0
131    1    0
133    2    1
134    1   -1
135    1    0
136    1    0
136    0   -1
136    0    0
136    0    0
136    0    0
135   -1   -1
134   -1    0
133   -1    0
131   -2   -1
130   -1    1
128   -2   -1
127   -1    1
125   -2   -1
124   -1    1
123   -1    0
122   -1    0
122    0    1
122    0    0
122    0    0
123    1    1
124    1    0
125    1    0
126    1    0
128    2    1
128    0   -2
130    2    2
130    0   -2
131    1    1
131    0   -1
131    0    0
131    0    0
130   -1   -1
130    0    1
129   -1   -1
128   -1    0
126   -2   -1
125   -1    1
124   -1    0
123   -1    0
123    0    1
123    0    0
123    0    0
124    1    1
125    1    0
126    1    0
127    1    0
128    1    0
130    2    1
131    1   -1
132    1    0
133    1    0
133    0   -1
134    1    1
134    0   -1
134    0    0
134    0    0
134    0    0
133   -1   -1
132   -1    0
132    0    1
131   -1   -1
130   -1    0
130    0    1
129   -1   -1
129    0    1

答案 1 :(得分:3)

以下是一组完整的类,可用于使用以下方法压缩此数据:

  1. First Base64解码字符串以获取字节(345 - > 256字节)
  2. 然后对字节进行delta编码(意味着,从前一个字节中减去一个字节,存储结果)
  3. 另一轮delta编码(但3比2差)
  4. 然后使用Huffman压缩来压缩增量
  5. 这种方法降低到76个字节,包括以后解压缩所需的开销。

    可以找到包含代码的完整Mercurial存储库here

    注意!代码可能包含边缘情况下的错误,例如空或接近空的输入。可以在上面链接的存储库中找到一套单元测试,但可能需要进行更多测试才能将此代码信任用于生产用途。

    测试程序:

    static void Main(string[] args)
    {
        string inputString = "ZmNiYWNma3F5gYqSmqCkpqenpKGdmJONiIN/e3d1c3FxcXFxcnN0dXZ3eXp7fHx9fn5/gIGDhISFhYWFhIOCgYGAgICBgoOEhYeIiYmJiYiGhIF+fHl3dHNyc3N1d3p9gIOGiYuMjY2NjIuKiIeFg4GAfnx7eXl4eHl5enx9fn5/gICAgYGBgYGBgYGBgoKDhISEhISEhIOCgoGAgH9/fn5+fn5/gICAgICAgICAgICAgICAf39/gICBgoOFhoeIiIiIiIeGhYOCgH99fHt6enp6e3x9foCAgoKDg4ODgoKBgH59fHt7e3t8fX5/gIKDhIWFhoaGhoaFhISDgoKBgQ==";
        byte[] original = Convert.FromBase64String(inputString);
    
        Console.WriteLine($"Original String: {inputString.Length}");
        Console.WriteLine($"Original bytes: {original.Length}");
    
        byte[] deltaEncoded = DeltaEncoderDecoder.Encode(original, 2);
        byte[] compressed = HuffmanCompression.Compress(deltaEncoded);
        Console.WriteLine($"Compressed bytes: {compressed.Length}");
    
        byte[] deltaDecoded = HuffmanCompression.Decompress(compressed);
        byte[] decompressed = DeltaEncoderDecoder.Decode(deltaDecoded, 2);
        Console.WriteLine($"Decompressed bytes: {decompressed.Length}");
    
        Console.WriteLine($"Decompressed == original: {decompressed.Length == original.Length && Enumerable.Range(0, original.Length).All(index => original[index] == decompressed[index])}");
    }
    

    输出:

    Original String: 344
    Original bytes: 256
    Compressed bytes: 76
    Decompressed bytes: 256
    Decompressed == original: True
    

    以下是压缩和解压缩数据所需的类:

    public static class HuffmanCompression
    {
        internal const byte CompressedSignature = 255;
        internal const byte UncompressedSignature = 0;
    
        [NotNull]
        public static byte[] Compress([NotNull] byte[] input)
        {
            if (input == null)
                throw new ArgumentNullException(nameof(input));
            if (input.Length == 0)
                return input;
    
            var rootNode = GetNodesFromRawInput(input);
            var bitStrings = GetBitStringsFromTree(rootNode);
    
            var output = new MemoryStream(input.Length);
            var writer = new BitStreamWriter(output);
    
            writer.Write(CompressedSignature);
            writer.Write(input.Length);
            WriteNodes(writer, rootNode);
            WriteStrings(writer, bitStrings, input);
            writer.Flush();
    
            if (output.Length < input.Length + 1)
                return output.ToArray();
    
            return EncodeAsUncompressed(input);
        }
    
        [NotNull]
        private static byte[] EncodeAsUncompressed([NotNull] byte[] input)
        {
            var output = new MemoryStream();
            output.WriteByte(UncompressedSignature);
            output.Write(input, 0, input.Length);
            return output.ToArray();
        }
    
        private static void WriteStrings([NotNull] BitStreamWriter writer, [NotNull] string[] bitStrings, [NotNull] byte[] input)
        {
            foreach (byte value in input)
            {
                Assume(bitStrings[value] != null);
                foreach (char bitChar in bitStrings[value])
                    writer.Write(bitChar == '1');
            }
        }
    
        private static void WriteNodes([NotNull] BitStreamWriter writer, [NotNull] Node node)
        {
            if (node.Left == null)
            {
                writer.Write(false);
                writer.Write(node.Value);
            }
            else
            {
                Assume(node.Right != null);
    
                writer.Write(true);
                WriteNodes(writer, node.Left);
                WriteNodes(writer, node.Right);
            }
        }
    
        [NotNull, ItemNotNull]
        private static string[] GetBitStringsFromTree([NotNull] Node node)
        {
            var result = new string[256];
            TraverseToGetBitStringsFromTree(node, string.Empty, result);
            return result;
        }
    
        private static void TraverseToGetBitStringsFromTree([NotNull] Node node, [NotNull] string prefix, [NotNull, ItemNotNull] string[] dictionary)
        {
            if (node.Left != null)
            {
                Assume(node.Right != null);
    
                TraverseToGetBitStringsFromTree(node.Left, prefix + "0", dictionary);
                TraverseToGetBitStringsFromTree(node.Right, prefix + "1", dictionary);
            }
            else
                dictionary[node.Value] = prefix;
        }
    
        [NotNull]
        private static Node GetNodesFromRawInput([NotNull] byte[] input)
        {
            var occurances = new int[256];
            foreach (byte value in input)
                occurances[value]++;
    
            var nodes = new List<Node>(256);
            for (int index = 0; index < 256; index++)
                if (occurances[index] > 0)
                    nodes.Add(new Node
                    {
                        Occurances = occurances[index],
                        Value = (byte)index
                    });
    
            while (nodes.Count > 1)
            {
                nodes.Sort((n1, n2) =>
                {
                    Assume(n1 != null && n2 != null);
    
                    return n1.Occurances.CompareTo(n2.Occurances);
                });
    
                Assume(nodes[0] != null && nodes[1] != null);
    
                nodes[0] = new Node
                {
                    Left = nodes[0],
                    Right = nodes[1],
                    Occurances = nodes[0].Occurances + nodes[1].Occurances
                };
                nodes.RemoveAt(1);
            }
    
            Assume(nodes[0] != null);
    
            return nodes[0];
        }
    
        [NotNull]
        public static byte[] Decompress([NotNull] byte[] input)
        {
            if (input == null)
                throw new ArgumentNullException(nameof(input));
            if (input.Length == 0)
                return input;
    
            if (input[0] != CompressedSignature)
                return DecodeUncompressed(input);
    
            var reader = new BitStreamReader(new MemoryStream(input));
            reader.ReadByte(); // skip signature
    
            int length = reader.ReadInt32();
            var rootNode = ReadNodes(reader);
            var output = new byte[length];
            for (int index = 0; index < length; index++)
                output[index] = DecompressOneByte(reader, rootNode);
    
            return output;
        }
    
        private static byte DecompressOneByte([NotNull] BitStreamReader reader, [NotNull] Node node)
        {
            while (node.Left != null)
            {
                if (reader.ReadBit())
                    node = node.Right;
                else
                    node = node.Left;
    
                Assume(node != null);
            }
    
            return node.Value;
        }
    
        [NotNull]
        private static Node ReadNodes([NotNull] BitStreamReader reader)
        {
            if (reader.ReadBit())
                return new Node
                {
                    Left = ReadNodes(reader),
                    Right = ReadNodes(reader)
                };
    
            return new Node
            {
                Value = reader.ReadByte()
            };
        }
    
        [NotNull]
        private static byte[] DecodeUncompressed([NotNull] byte[] input)
        {
            return input.Skip(1).ToArray();
        }
    }
    
    public class BitStreamReader
    {
        [NotNull]
        private readonly MemoryStream _Source;
    
        private byte _Buffer;
        private int _InBuffer;
    
        public BitStreamReader([NotNull] MemoryStream source)
        {
            if (source == null)
                throw new ArgumentNullException(nameof(source));
    
            _Source = source;
        }
    
        public bool ReadBit()
        {
            if (_InBuffer == 0)
                FillBuffer();
    
            return (_Buffer & BitStreamConstants.BitMasks[8 - _InBuffer--]) != 0;
        }
    
        public byte ReadByte()
        {
            if (_InBuffer == 8)
            {
                _InBuffer = 0;
                return _Buffer;
            }
    
            return (byte)((ReadBit() ? 128 : 0) | (ReadBit() ? 64 : 0) | (ReadBit() ? 32 : 0) | (ReadBit() ? 16 : 0) | (ReadBit() ? 8 : 0) | (ReadBit() ? 4 : 0) | (ReadBit() ? 2 : 0) | (ReadBit() ? 1 : 0));
        }
    
        public int ReadInt32()
        {
            int result = 0;
            if (ReadBit())
                result |= ReadByte();
            if (ReadBit())
                result |= ReadByte() << 8;
            if (ReadBit())
                result |= ReadByte() << 16;
            if (ReadBit())
                result |= ReadByte() << 24;
            return result;
        }
    
        private void FillBuffer()
        {
            int value = _Source.ReadByte();
            if (value < 0)
                throw new InvalidOperationException("Read past end of source stream");
    
            _Buffer = (byte)value;
            _InBuffer = 8;
        }
    }
    
    public class BitStreamWriter
    {
        [NotNull]
        private readonly MemoryStream _Target;
    
        private byte _Buffer;
        private int _InBuffer;
    
        public BitStreamWriter([NotNull] MemoryStream target)
        {
            if (target == null)
                throw new ArgumentNullException(nameof(target));
    
            _Target = target;
        }
    
        public void Flush()
        {
            if (_InBuffer == 0)
                return;
    
            _Target.WriteByte(_Buffer);
            _Buffer = 0;
            _InBuffer = 0;
        }
    
        public void Write(bool bit)
        {
            unchecked
            {
                if (bit)
                    _Buffer = (byte)(_Buffer | 1 << (7 - _InBuffer));
                if (++_InBuffer == 8)
                    Flush();
            }
        }
    
        public void Write(byte value)
        {
            for (int index = 0; index < 8; index++)
                Write((value & BitStreamConstants.BitMasks[index]) != 0);
        }
    
        public void Write(int value)
        {
            byte b0 = (byte)(value & 0xff);
            byte b1 = (byte)((value >> 8) & 0xff);
            byte b2 = (byte)((value >> 16) & 0xff);
            byte b3 = (byte)((value >> 24) & 0xff);
    
            Write(b0 != 0);
            if (b0 != 0)
                Write(b0);
    
            Write(b1 != 0);
            if (b1 != 0)
                Write(b1);
    
            Write(b2 != 0);
            if (b2 != 0)
                Write(b2);
    
            Write(b3 != 0);
            if (b3 != 0)
                Write(b3);
        }
    }
    
    internal static class BitStreamConstants
    {
        [NotNull]
        public static readonly byte[] BitMasks = { 128, 64, 32, 16, 8, 4, 2, 1 };
    
        public const byte CompressedSignature = 255;
    }
    
    public static class DeltaEncoderDecoder
    {
        [NotNull]
        public static byte[] Encode([NotNull] byte[] input, int iterations)
        {
            if (input == null)
                throw new ArgumentNullException(nameof(input));
    
            var output = new byte[input.Length];
            Buffer.BlockCopy(input, 0, output, 0, input.Length);
    
            while (iterations-- > 0)
            {
                byte previous = 0;
                for (int index = 0; index < output.Length; index++)
                {
                    byte current = output[index];
                    output[index] = (byte)(current - previous);
                    previous = current;
                }
            }
    
            return output;
        }
    
        [NotNull]
        public static byte[] Decode([NotNull] byte[] input, int iterations)
        {
            if (input == null)
                throw new ArgumentNullException(nameof(input));
    
            var output = new byte[input.Length];
            Buffer.BlockCopy(input, 0, output, 0, input.Length);
    
            while (iterations-- > 0)
            {
                byte previous = 0;
                for (int index = 0; index < output.Length; index++)
                {
                    output[index] = (byte)(previous + output[index]);
                    previous = output[index];
                }
            }
    
            return output;
        }
    }