Question

我想为一个性能关键应用程序反序列化一百万对（String，Guid）的列表。格式可以是我选择的任何内容，序列化不具有相同的性能要求。

哪种方法最好？文字还是二进制？连续写每一对（字符串，guid），或者写下所有字符串后跟所有guid？

我开始玩LinqPad，（以及仅反序列化字符串的简单示例）并发现（略有反直觉），使用TextReader和ReadLine()比使用BinaryReader和ReadString()。（文件系统缓存是否会对我起作用？）

public string[] DeSerializeBinary()
{
    var tmr = System.Diagnostics.Stopwatch.StartNew();
    long ms = 0;
    string[] arr = null;
    using (var rdr = new BinaryReader(new FileStream(file, FileMode.Open, FileAccess.Read)))
    {
        var num = rdr.ReadInt32();
        arr = new String[num];
        for (int i = 0; i < num; i++)
        {
            arr[i] = rdr.ReadString();
        }
        tmr.Stop();
        ms = tmr.ElapsedMilliseconds;
        Console.WriteLine("DeSerializeBinary took {0}ms", ms);
    }
    return arr;
}

public string[] DeserializeText()
{
    var tmr = System.Diagnostics.Stopwatch.StartNew();
    long ms = 0;
    string[] arr = null;
    using (var rdr = File.OpenText(file))
    {
        var num = Int32.Parse(rdr.ReadLine());
        arr = new String[num];
        for (int i = 0; i < num; i++)
        {
            arr[i] = rdr.ReadLine();
        }
        tmr.Stop();
        ms = tmr.ElapsedMilliseconds;
        Console.WriteLine("DeserializeText took {0}ms", ms);
    }
    return arr;
}

一些编辑：

我使用RamMap来清除文件系统缓存，结果发现Text＆amp;仅限字符串的二进制阅读器。
我有一个相当简单的类，它包含字符串和guid。它还包含一个int索引，它对应于它在列表中的位置。显然，没有必要将其包含在序列化中。
在（二元）deSerializing Strings and Guids的交替测试中，我得到了大约500ms。
理想的时间是50毫秒，或尽可能接近。然而，一个简单的实验表明，从一个相当快的SSD驱动器读取（压缩）文件到内存需要至少120ms，而根本没有任何解析。所以50毫秒似乎不太可能。
我们的字符串没有理论长度限制。但是，我们可以假设性能目标仅适用于全部20个字符或更少的字符。
计时包括打开文件。

Profile of current code experiment

阅读字符串现在是一个明显的瓶颈（因此我的序列化字符串的实验）。在我预先分配一个16字节的数组来读取GUID之前，JIT_NewFast占了30％。

Answer 1

使用StreamReader阅读大量字符串比使用BinaryReader更快，这并不奇怪。 StreamReader从底层流中读取块，并从该缓冲区中解析字符串。 BinaryReader没有这样的缓冲区。它从底层流中读取字符串长度，然后读取那么多字符。因此，BinaryReader会对基本流的Read方法进行更多调用。

但是，对(String, Guid)对进行反序列化不仅仅是阅读。你还必须解析Guid。如果以二进制文件编写文件，则Guid以二进制编写，这使得创建Guid结构变得更加容易和快捷。如果它是一个字符串，那么您必须调用new Guid(string)来解析文本并在之后创建一个Guid，，将该行拆分为两个字段。

很难说哪一个更快。

我无法想象我们在这里谈论了很多时间。当然，读取一百万行的文件大约需要一秒钟。除非字符串真的很长。如果计算分隔符，GUID只有36个字符，对吗？

使用BinaryWriter，您可以像这样编写文件：

writer.Write(count); // integer number of records foreach (var pair in pairs) { writer.Write(pair.theString); writer.Write(pair.theGuid.ToByteArray()); }

要阅读它，你有：

count = reader.ReadInt32(); byte[] guidBytes = new byte[16]; for (int i = 0; i < count; ++i) { string s = reader.ReadString(); reader.Read(guidBytes, 0, guidBytes.Length); pairs.Add(new Pair(s, new Guid(guidBytes)); }

这是否比分割字符串和调用带有字符串参数的Guid构造函数更快，我不知道。

我怀疑任何差异都会非常轻微。我可能会选择最简单的方法：文本文件。

如果你想变得非常疯狂，你可以编写一个自定义格式，只需几个大的读取（标题，索引和两个字符串和GUID数组）就可以轻松搞定，并完成所有其他工作在记忆中。这几乎肯定会更快。但是足够快以保证额外的工作？值得怀疑的。

更新

或者也许并不怀疑。这是一些编写和读取自定义二进制格式的代码。格式为：

count（int32）

guids（count * 16 bytes）

字符串（一个大的连接字符串）

index（大字符串中每个字符串的起始字符的索引）

我假设您使用Dictionary<string, Guid>来保存这些内容。但是你的数据结构并不重要。代码基本相同。

请注意，我对此进行了非常简短的测试。我不会说代码是100％无错误的，但我认为你可以了解我正在做的事情。

private void WriteGuidFile(string filename, Dictionary<string, Guid>guids) { using (var fs = File.Create(filename)) { using (var writer = new BinaryWriter(fs, Encoding.UTF8)) { List<int> stringIndex = new List<int>(guids.Count); StringBuilder bigString = new StringBuilder(); // write count writer.Write(guids.Count); // Write the GUIDs and build the string index foreach (var pair in guids) { writer.Write(pair.Value.ToByteArray(), 0, 16); stringIndex.Add(bigString.Length); bigString.Append(pair.Key); } // Add one more entry to the string index. // makes deserializing easier stringIndex.Add(bigString.Length); // Write the string that contains all of the strings, combined writer.Write(bigString.ToString()); // write the index foreach (var ix in stringIndex) { writer.Write(ix); } } } }

阅读稍微复杂一点：

private Dictionary<string, Guid> ReadGuidFile(string filename) { using (var fs = File.OpenRead(filename)) { using (var reader = new BinaryReader(fs, Encoding.UTF8)) { // read the count int count = reader.ReadInt32(); // The guids are in a huge byte array sized 16*count byte[] guidsBuffer = new byte[16*count]; reader.Read(guidsBuffer, 0, guidsBuffer.Length); // Strings are all concatenated into one var bigString = reader.ReadString(); // Index is an array of int. We can read it as an array of // ((count+1) * 4) bytes. byte[] indexBuffer = new byte[4*(count+1)]; reader.Read(indexBuffer, 0, indexBuffer.Length); var guids = new Dictionary<string, Guid>(count); byte[] guidBytes = new byte[16]; int startix = 0; int endix = 0; for (int i = 0; i < count; ++i) { endix = BitConverter.ToInt32(indexBuffer, 4*(i+1)); string key = bigString.Substring(startix, endix - startix); Buffer.BlockCopy(guidsBuffer, (i*16), guidBytes, 0, 16); guids.Add(key, new Guid(guidBytes)); startix = endix; } return guids; } } }

这里有几点注释。首先，我使用BitConverter将字节数组中的数据转换为整数。使用不安全的代码并使用int32*索引到数组中会更快。

你可能通过使用指针索引到guidBuffer并调用Guid Constructor (Int32, Int16, Int16, Byte, Byte, Byte, Byte, Byte, Byte, Byte, Byte)而不是使用Buffer.BlockCopy将GUID复制到临时数组中来获得一些速度

您可以使字符串索引成为长度索引而不是起始位置。这样就不需要在数组末尾增加额外值，但它不太可能在速度上产生任何差异。

可能还有其他优化机会，但我认为你在这里得到了一般性的想法。

需要一种快速的方法来反序列化100万个字符串＆amp; c＃中的guids

1 个答案:

更新