Question

我正在创建一个分析文件数据质量的工具。所以我需要阅读文件的每一行并分析它们中的每一行。我还需要在内存中存储我文件的所有行，因为用户可以深入到特定部分。所以基本上所有工作都适用于包含数千行的文件。但是，当尝试使用包含超过4百万行的CSV文件时，我会遇到内存不足异常。我认为C＃能够在其内存缓存中处理数百万个数据但看起来并不像。所以我有点卡住，不知道该怎么做。也许我的代码片段不是最高效的，所以如果你能告诉我改进它的方法会很棒吗？请记住，我需要在内存中包含文件的所有行，因为根据用户的操作，我需要访问特定行以将其显示给用户。

以下是读取每一行的电话

using (FileStream fs = File.Open(this.dlgInput.FileName.ToString(),   FileMode.Open, FileAccess.Read, FileShare.Read))
using (BufferedStream bs = new BufferedStream(fs))
using (System.IO.StreamReader sr = new  StreamReader(this.dlgInput.FileName.ToString(), Encoding.Default, false, 8192))
{
    string line;
    if (this.chkSkipHeader.Checked)
    {
        sr.ReadLine();
    }

    progressBar1.Visible = true;
    int nbOfLines = File.ReadLines(this.dlgInput.FileName.ToString()).Count();
    progressBar1.Maximum = nbOfLines;

    this.lines = new string[nbOfLines][];
    this.patternedLines = new string[nbOfLines][];
    for (int i = 0; i < nbOfLines; i++)
    {
        this.lines[i] = new string[this.dgvFields.Rows.Count];
        this.patternedLines[i] = new string[this.dgvFields.Rows.Count];
    }

    // Read and display lines from the file until the end of 
    // the file is reached.
    while ((line = sr.ReadLine()) != null)
    {
        this.recordCount += 1;
        char[] c = new char[1] { ',' };
        System.Text.RegularExpressions.Regex CSVParser = new System.Text.RegularExpressions.Regex(",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))");
        String[] fields = CSVParser.Split(line);
        ParseLine(fields);
        this.lines[recordCount - 1] = fields;
        progressBar1.PerformStep();
    }
}

以下是ParseLine函数，它还通过数组保存在内存中，需要进行一些分析：

private void ParseLine(String[] fields2)
{
    for (int j = 0; j <= fields2.Length - 1; j++)
    {
        if ((int)this.dgvFields.Rows[j].Cells["colSelected"].Value == 1)
        {
            /*' ************************************************
            ' Save Number of Counts by  Value
            ' ************************************************/

            if (this.values[j].ContainsKey(fields2[j]))
            {
                //values[0] = Dictionary<"TEST", 1> (fields2[0 which is source code] = count])
                this.values[j][fields2[j]] += 1;
            }
            else
            {
                this.values[j].Add(fields2[j], 1);
            }

            /* ' ************************************************
            ' Save Pattern Values/Counts
            ' ************************************************/

            string tmp = System.Text.RegularExpressions.Regex.Replace(fields2[j], "\\p{Lu}", "X");
            tmp = System.Text.RegularExpressions.Regex.Replace(tmp, "\\p{Ll}", "x");
            tmp = System.Text.RegularExpressions.Regex.Replace(tmp, "[0-9]", "0");


            if (this.patterns[j].ContainsKey(tmp))
            {
                this.patterns[j][tmp] += 1;
            }
            else
            {
                this.patterns[j].Add(tmp, 1);
            }

            this.patternedLines[this.recordCount - 1][j] = tmp;
            /* ' ************************************************
             ' Count Blanks/Alpha/Numeric/Phone/Other
             ' ************************************************/


            if (String.IsNullOrWhiteSpace(fields2[j]))
            {
                this.blanks[j] += 1;
            }
            else if (System.Text.RegularExpressions.Regex.IsMatch(fields2[j], "^[0-9]+$"))
            {
                this.numeric[j] += 1;
            }
            else if (System.Text.RegularExpressions.Regex.IsMatch(fields2[j].ToUpper().Replace("EXTENSION", "").Replace("EXT", "").Replace("X", ""), "^[0-9()\\- ]+$"))
            {
                this.phone[j] += 1;
            }
            else if (System.Text.RegularExpressions.Regex.IsMatch(fields2[j], "^[a-zA-Z ]+$"))
            {
                this.alpha[j] += 1;
            }
            else
            {
                this.other[j] += 1;
            }

            if (this.recordCount == 1)
            {
                this.high[j] = fields2[j];
                this.low[j] = fields2[j];
            }
            else
            {
                if (fields2[j].CompareTo(this.high[j]) > 0)
                {
                    this.high[j] = fields2[j];
                }

                if (fields2[j].CompareTo(this.low[j]) < 0)
                {
                    this.low[j] = fields2[j];
                }
            }
        }
    }
}

更新：新代码

int nbOfLines = File.ReadLines(this.dlgInput.FileName.ToString()).Count();
        //Read file

        using (System.IO.StreamReader sr = new StreamReader(this.dlgInput.FileName.ToString(), Encoding.Default, false, 8192))
        {
            string line;
            if (this.chkSkipHeader.Checked)
            { sr.ReadLine(); }
            progressBar1.Visible = true;

            progressBar1.Maximum = nbOfLines;
            this.lines = new string[nbOfLines][];
            this.patternedLines = new string[nbOfLines][];
            for (int i = 0; i < nbOfLines; i++)
            {
                this.lines[i] = new string[this.dgvFields.Rows.Count];
                this.patternedLines[i] = new string[this.dgvFields.Rows.Count];
            }

            // Read and display lines from the file until the end of 
            // the file is reached.
            while ((line = sr.ReadLine()) != null)
            {
                this.recordCount += 1;
                char[] c = new char[1] { ',' };
                System.Text.RegularExpressions.Regex CSVParser = new System.Text.RegularExpressions.Regex(",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))");
                String[] fields = CSVParser.Split(line);
                ParseLine(fields);
                this.lines[recordCount - 1] = fields;
                progressBar1.PerformStep();
            }
        }

Answer 1

C＃对单个对象的大小有限制（因此例外）。考虑这样一个事实：即使数组中的每个字符串都是1个字节，400万个字节仍然是4千兆字节，据我所知，.NET中单个对象的默认最大大小为2千兆字节。无论您的整个系统有多少内存，都是如此。

Stack Overflow上有几篇文章如何创建大数组：I need very big array length(size) in C#和OutOfMemoryException on declaration of Large Array

据我了解，这部分是由于.NET框架如何管理从32位到64位的转换。（另请注意，2千兆字节大致对应于32位有符号整数的最大值）。在较新版本的.NET中（根据我读过的4.5之后但我从未尝试过）我想你可以在某种程度上改变最大对象大小。您还可以使用特殊类（例如自定义BigArray类）来绕过空间限制。

请记住，数组要求它能够分配连续的内存地址（这就是为什么你可以通过索引进行常量时间访问 - 地址是一个常量的偏移量指向第一个项目的指针，因此框架可以通过将索引乘以32或其他常量（取决于内存大小）并将其添加到指向第一个项目的指针中的地址来确定内存位置，以确定项目的位置）。因此，内存中的碎片会减少阵列可用的有效内存量。

Answer 2

您需要创建一个辅助类，它将缓存整个文件中每行的起始位置。

 int[] cacheLineStartPos;

 public string GetLine (int lineNumber)
 {
     int linePositionInFile = cacheLineStartPos[lineNumber];

     reader.Position = linePositionInFile;

     return reader.ReadLine();
 }

当然它只是一个例子，逻辑可能更复杂。

Answer 3

如果您需要处理大量数据，请考虑使用数据库。它们的设计正是为了这种目的。您也可以使用特定请求查询它们。可能一个键值存储已经足够你了。请查看https://ravendb.net/或https://www.mongodb.com/

Answer 4

即使你需要用户对所有数据采取行动，你也不应该把所有的线都放在记忆中，类似于裸尾，你必须从磁盘读取线，并且只是为了用户可见的窗口，当他寻找更多的数据时，你会从具有相同窗口宽度的磁盘流更多，但从不对所有线路，想到像40 GB这样的文件......所有这些都被加载是不切实际的。 Here是如何做到这一点的一个示例，并且根据其他成员的要求，这里是所提及答案的代码，归功于@ James King

//  This really needs to be a member-level variable;
private static readonly object fsLock = new object();

//  Instantiate this in a static constructor or initialize() method
private static FileStream fs = new FileStream("myFile.txt", FileMode.Open);


public string ReadFile(int fileOffset) {

    byte[] buffer = new byte[bufferSize];

    int arrayOffset = 0;

    lock (fsLock) {
        fs.Seek(fileOffset, SeekOrigin.Begin);

        int numBytesRead = fs.Read(bytes, arrayOffset , bufferSize);

        //  Typically used if you're in a loop, reading blocks at a time
        arrayOffset += numBytesRead;
    }

    // Do what you want to the byte array and return it

}

阅读大文件时内存不足

4 个答案: