Question

我需要以尽可能小的文件大小序列化以下数据。

我有一组模式，每个模式都是一个设定长度的字节数组（byte[]）。

在这个例子中，让我们使用5的模式长度，所以字节数组将是：

var pattern = new byte[] {1, 2, 3, 4, 5};

假设我们在集合中有3个相同的模式：

var collection = new byte[][] { pattern, pattern, pattern };

目前我正在将该集合保存在ASCII编码文件中。使用上面的集合，保存的文件将如下所示：

010203040501020304050102030405

数组中的每个字节由2位数字（00）表示，以便我可以满足0到25之间的字节值，它可以像这样显示：

[01 | 02 | 03 | 04 | 05] [01 | 02 | 03 | 04 | 05] [01 | 02 | 03 | 04 | 05]

当我反序列化文件时，我将每个2个字符的块解析为一个字节，并将每5个字节放入一个字节数组中。

根据我的理解，ASCII编码文件中的每个字符都是一个字节 - 提供可能的256个不同的值，但我需要的是每个2个字符的块是从0到25的可能的十进制值。 / p>

当我保存一个包含50,000个模式的文件，每个模式的长度为12时，我最终得到一个1.7MB的文件，这太大了。

我可以在C＃中使用哪种编码来缩小文件大小？

请提供如何在文件中写入和读取此数据的示例代码。

Answer 1

在将二进制数据编码为条形码时，我做了类似的事情（参见Efficient compression and representation of key value pairs to be read from 1D barcodes）。请考虑以下代码，它将样本序列化到文件中并立即反序列化：

static void Main(string[] args)
{
    var data = new List<byte[]>() {
        new byte[] { 01, 05, 15, 04, 11, 00, 01, 01, 05, 15, 04, 11, 00, 01 },
        new byte[] { 09, 04, 02, 00, 08, 12, 01, 07, 04, 02, 00, 08, 12, 01 },
        new byte[] { 01, 05, 06, 04, 02, 00, 01, 01, 05, 06, 04, 02, 00, 01 }
    };

    // has to be known when loading the file
    var reasonableBase = data.SelectMany(i => i).Max() + 1;

    using (var target = File.OpenWrite("data.bin"))
    {
        using (var writer = new BinaryWriter(target))
        {
            // write the number of lines (16 bit, lines limited to 65536)
            writer.Write((ushort)data.Count);

            // write the base (8 bit, base limited to 255)
            writer.Write((byte)reasonableBase);

            foreach (var sample in data)
            {
                // converts the byte array into a large number of the known base (bypasses all the bit-mess)
                var serializedData = ByteArrayToNumberBased(sample, reasonableBase).ToByteArray();

                // write the length of the sample (8 bit, limited to 255)
                writer.Write((byte)serializedData.Length);
                writer.Write(serializedData);
            }
        }
    }

    var deserializedData = new List<byte[]>();

    using (var source = File.OpenRead("data.bin"))
    {
        using (var reader = new BinaryReader(source))
        {
            var lines = reader.ReadUInt16();
            var sourceBase = reader.ReadByte();

            for (int i = 0; i < lines; i++)
            {
                var length = reader.ReadByte();
                var value = new BigInteger(reader.ReadBytes(length));

                // chunk the bytes back of the big number we loaded
                // works because we know the base
                deserializedData.Add(NumberToByteArrayBased(value, sourceBase));
            }
        }
    }
}

private static BigInteger ByteArrayToNumberBased(byte[] data, int numBase)
{
    var result = BigInteger.Zero;

    for (int i = 0; i < data.Length; i++)
    {
        result += data[i] * BigInteger.Pow(numBase, i);
    }

    return result;
}

private static byte[] NumberToByteArrayBased(BigInteger data, int numBase)
{
    var list = new List<Byte>();

    do
    {
        list.Add((byte)(data % numBase));
    }
    while ((data = (data / numBase)) > 0);

    return list.ToArray();
}

与您的格式相比，样本数据将序列化为27个字节而不是90个。使用@ xanatos的每个符号4.7位，完美的结果将是14 * 3 * 4.7 / 8 = 24,675 bytes，所以这不错（公平地说：示例序列化为30个字节，基数设置为26）。

Answer 2

以下是如何使用GZipStream和BinaryFormatter从压缩文件读取数据和向压缩文件写入数据的示例。

对于小型阵列来说效率不高，但对于大型阵列则效率更高。但请注意，这依赖于可压缩的数据 - 如果不是，那么这将不会有任何用处！

using System;
using System.IO;
using System.IO.Compression;
using System.Linq;
using System.Runtime.Serialization.Formatters.Binary;

namespace Demo
{
    static class Program
    {
        static void Main()
        {
            var pattern    = new byte[] { 1, 2, 3, 4, 5 };
            var collection = new [] { pattern, pattern, pattern };

            string filename = @"e:\tmp\test.bin";
            zipToFile(filename, collection);

            var deserialised = unzipFromFile(filename);

            Console.WriteLine(string.Join("\n", deserialised.Select(row => string.Join(", ", row))));
        }

        static void zipToFile(string file, byte[][] data)
        {
            using (var output = new FileStream(file, FileMode.Create))
            using (var gzip   = new GZipStream(output, CompressionLevel.Optimal))
            {
                new BinaryFormatter().Serialize(gzip, data);
            }
        }

        static byte[][] unzipFromFile(string file)
        {
            using (var input = new FileStream(file, FileMode.Open))
            using (var gzip  = new GZipStream(input, CompressionMode.Decompress))
            {
                return (byte[][]) new BinaryFormatter().Deserialize(gzip);
            }
        }
    }
}

Answer 3

有时简单是最好的妥协。

矩形阵列可以被认为是一系列线性阵列。

字节文件是字节的线性数组。

这是一个非常简单的代码，用于将矩形字节数组和写入字节转换为文件：

// All patterns must be the same length so they can be split when reading
File.WriteAllBytes(Path.GetTempFileName(), collection.SelectMany(p => p).ToArray());

System.Linq.Enumerable.SelectMany(pattern => pattern)采用一系列序列并将它们展平为序列。（它与ToArray（）一起效率最高，但对于50,000 * 4元素，它可能没问题。）

鉴于作为一个起点，如果需要压缩，Zip将成为一种方式，如shown by Matthew Watson。

C＃ - 如何将字节值保存到文件最小的文件？

3 个答案: