Question

我做了一些研究，发现读取和编写多格（+ 5GB）文件的最有效方法是使用类似下面的代码：

using (FileStream fs = File.Open(file, FileMode.Open, FileAccess.Read, FileShare.Read))
using (BufferedStream bs = new BufferedStream(fs, 256 * 1024))
using (StreamReader sr = new StreamReader(bs, Encoding.ASCII, false, 256 * 1024))
{
    StreamWriter sw = new StreamWriter(outputFile, true, Encoding.Unicode, 256 * 1024);
    string line = "";

    while (sr.BaseStream != null && (line = sr.ReadLine()) != null)
    {
        //Try to clean csv then split
        line = Regex.Replace(line, "[\\s\\dA-Za-z][\"][\\s\\dA-Za-z]", ""); 
        string[] fields = Regex.Split(line, ",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");
        //I know there are libraries for this that I will switch out 
        //when I have time to create the classes as it seems they all
        //require a mapping class

        //Remap 90-250 properties
        object myObj = ObjectMapper(fields);

        //Write line
        bool success = ObjectWriter(myObj);
    }

    sw.Dispose();
}

英特尔至强2.67 GHz的3个实例中，每个实例的CPU平均约为33％。我能够在〜26小时内输出2个文件，这些文件只有不到7GB，而该过程使用以下命令运行3个实例：

Parallel.Invoke(
    () => new Worker().DoWork(args[0]),
    () => new Worker().DoWork(args[1]),
    () => new Worker().DoWork(args[2])
);

第三个实例是生成一个 MUCH 更大的文件，到目前为止，+ 34GB，并且在第3天，即67个小时即将到来。

从我读过的内容来看，我认为通过将缓冲区降低到最佳位置可以提高性能。

我的问题是：

根据陈述的内容，这是典型的表现吗？
除了我上面提到的，你还能看到其他任何改进吗？
CSV映射和读取库是否比正则表达式快得多？

Answer 1

因此，首先，您应该分析您的代码以识别瓶颈。

Visual Studio为此提供了内置的分析器，可以清楚地识别代码中的热点。

鉴于您的进程受CPU限制，这可能非常有效。

然而，如果我不得不猜测为什么它变慢，我会想象它是因为你没有重新使用你的正则表达式。构造正则表达式相对昂贵，因此重新使用它可以看到大量的性能改进。

var regex1 = new Regex("[\\s\\dA-Za-z][\"][\\s\\dA-Za-z]", RegexOptions.Compiled);
var regex2 = new Regex(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", RegexOptions.Compiled);
while (sr.BaseStream != null && (line = sr.ReadLine()) != null)
{
    //Try to clean csv then split
    line = regex1.Replace(line, ""); 
    string[] fields = regex2.Split(line);
    //I know there are libraries for this that I will switch out 
    //when I have time to create the classes as it seems they all
    //require a mapping class

    //Remap 90-250 properties
    object myObj = ObjectMapper(fields);

    //Write line
    bool success = ObjectWriter(myObj);
}

但是，我强烈建议您使用像Linq2Csv这样的库 - 它可能会更高效，因为它将进行多轮性能调整，并且它将处理您的代码不具备的边缘情况。

.NET性能：大型CSV读取，重新映射，写入重新映射

1 个答案: