文本文件 - 读取和修复分隔符问题 - 太慢

时间:2014-05-09 15:44:19

标签: c# asp.net .net file-io

我正在寻找一些关于如何更快地完成此功能的建议。

该函数旨在运行分隔的文本文件(使用CRLF行结束)并删除数据行之间的任何回车符或换行符。

E.g。文件 -

A|B|C|D
A|B|C|D
A|B|
C|D
A|B|C|D

会变成 -

A|B|C|D
A|B|C|D
A|B|C|D
A|B|C|D

该函数似乎运行良好,但是当我们开始处理大文件时,性能太慢。一个例子是 - 对于800k行需要3秒,对于1.3亿行需要一个多小时 ....

代码是 -

private void CleanDelimitedFile(string readFilePath, string writeFilePath, string delimiter, string problemFilePath, string rejectsFilePath, int estimateNumberOfRows)
    {
        ArrayList rejects = new ArrayList();
        ArrayList problems = new ArrayList();

        int safeSameLengthBreak = 0;
        int numberOfLinesSameLength = 0;
        int lineCount = 0;
        int maxCount = 0;
        string previousLine = string.Empty;
        string currentLine = string.Empty;

        // determine after how many rows with the same number of delimiter chars that we can safety 
        // say that we have found the expected length of a row (to save reading the full file twice)
        if (estimateNumberOfRows > 100000000)
            safeSameLengthBreak = estimateNumberOfRows / 200; // set the safe check limit as 0.5% of the file (minimum of 500,000)
        else if (estimateNumberOfRows > 10000000)
            safeSameLengthBreak = estimateNumberOfRows / 50; // set the safe check limit as 2% of the file (minimum of 200,000)
        else
            safeSameLengthBreak = 50000; // set the safe check limit as 50,000 (if there are less than 50,000 this wont be required anyway)

        // open a reader
        using (var reader = new StreamReader(readFilePath))
        {
            // check the file is still being read
            while (!reader.EndOfStream)
            {
                // append the line count (for debugging)
                lineCount += 1;

                // get the current line
                currentLine = reader.ReadLine();

                // get the number of chars in the new line
                int chars = (currentLine.Length - currentLine.Replace(delimiter, "").Length);

                // if the number is higher than the previous maximum set the new maximum
                if (maxCount < chars)
                {
                    maxCount = chars;

                    // the maximum has changed, reset the number of lines in a row with the same delimiter
                    numberOfLinesSameLength = 0;
                }
                else
                {
                    // the maximum has not changed, add to the number of lines in a row with the same delimiter
                    numberOfLinesSameLength += 1;
                }

                // is the number of lines parsed in a row with the same number of delimiter chars above the safe limit? If so break the loop
                if (numberOfLinesSameLength > safeSameLengthBreak)
                {
                    break;
                }
            }
        }

        // reset the line count
        lineCount = 0;

        // open a writer for the duration of the next read
        using (var writer = new StreamWriter(writeFilePath))
        {
            using (var reader = new StreamReader(readFilePath))
            {
                // check the file is still being read
                while (!reader.EndOfStream)
                {
                    // append the line count (for debugging)
                    lineCount += 1;

                    // get the current line
                    currentLine = reader.ReadLine();

                    // get the number of chars in the new line
                    int chars = (currentLine.Length - currentLine.Replace(delimiter, "").Length);

                    // check the number of chars in the line matches the required number
                    if (chars == maxCount)
                    {
                        // write line
                        writer.WriteLine(currentLine);

                        // clear the previous line variable as this was a valid write
                        previousLine = string.Empty;
                    }
                    else
                    {
                        // add the line to problems
                        problems.Add(currentLine);

                        // append the new line to the previous line
                        previousLine += currentLine;

                        // get the number of chars in the new appended previous line
                        int newPreviousChars = (previousLine.Length - previousLine.Replace(delimiter, "").Length);

                        // check the number of chars in the previous appended line matches the required number
                        if (newPreviousChars == maxCount)
                        {
                            // write line
                            writer.WriteLine(previousLine);

                            // clear the previous line as this was a valid write
                            previousLine = string.Empty;
                        }
                        else if (newPreviousChars > maxCount)
                        {
                            // the number of delimiter chars in the new line is higher than the file maximum, add to rejects
                            rejects.Add(previousLine);

                            // clear the previous line and move on
                            previousLine = string.Empty;
                        }
                    }
                }
            }
        }

        // rename the original file as _original
        System.IO.File.Move(readFilePath, readFilePath.Replace(".txt", "") + "_Original.txt");

        // rename the new file as the original file name
        System.IO.File.Move(writeFilePath, readFilePath);

        // Write rejects
        using (var rejectWriter = new StreamWriter(rejectsFilePath))
        {
            // loop through the problem array list and write the problem row to the problem file
            foreach (string reject in rejects)
            {
                rejectWriter.WriteLine(reject);
            }
        }

        // Write problems
        using (var problemWriter = new StreamWriter(problemFilePath))
        {
            // loop through the reject array list and write the reject row to the problem file
            foreach (string problem in problems)
            {
                problemWriter.WriteLine(problem);
            }
        }
    }

任何指针都会非常感激。

提前致谢。

1 个答案:

答案 0 :(得分:2)

一些想法

List<String> 

对于拒绝和问题,并为您认为他们需要

分配初始容量

不要通过网络进行处理 获取SSD,复制到它,处理,写入行,然后将文件复制回来

对于我来说,计算分界符似乎并不是一种有效的方法

int chars = (currentLine.Length - currentLine.Replace(delimiter, "").Length);

这很浪费:currentLine.Replace(分隔符,&#34;&#34;)

int chars = 0;
foreach(char c in currentLine) if (c == delimeter) chars++;

效率不高

previousLine += currentLine;

使用StringBuilder
并在循环外部分配StringBuilder
在循环调用.Clear()