我正在寻找一些关于如何更快地完成此功能的建议。
该函数旨在运行分隔的文本文件(使用CRLF行结束)并删除数据行之间的任何回车符或换行符。
E.g。文件 -
A|B|C|D
A|B|C|D
A|B|
C|D
A|B|C|D
会变成 -
A|B|C|D
A|B|C|D
A|B|C|D
A|B|C|D
该函数似乎运行良好,但是当我们开始处理大文件时,性能太慢。一个例子是 - 对于800k行需要3秒,对于1.3亿行需要一个多小时 ....
代码是 -
private void CleanDelimitedFile(string readFilePath, string writeFilePath, string delimiter, string problemFilePath, string rejectsFilePath, int estimateNumberOfRows)
{
ArrayList rejects = new ArrayList();
ArrayList problems = new ArrayList();
int safeSameLengthBreak = 0;
int numberOfLinesSameLength = 0;
int lineCount = 0;
int maxCount = 0;
string previousLine = string.Empty;
string currentLine = string.Empty;
// determine after how many rows with the same number of delimiter chars that we can safety
// say that we have found the expected length of a row (to save reading the full file twice)
if (estimateNumberOfRows > 100000000)
safeSameLengthBreak = estimateNumberOfRows / 200; // set the safe check limit as 0.5% of the file (minimum of 500,000)
else if (estimateNumberOfRows > 10000000)
safeSameLengthBreak = estimateNumberOfRows / 50; // set the safe check limit as 2% of the file (minimum of 200,000)
else
safeSameLengthBreak = 50000; // set the safe check limit as 50,000 (if there are less than 50,000 this wont be required anyway)
// open a reader
using (var reader = new StreamReader(readFilePath))
{
// check the file is still being read
while (!reader.EndOfStream)
{
// append the line count (for debugging)
lineCount += 1;
// get the current line
currentLine = reader.ReadLine();
// get the number of chars in the new line
int chars = (currentLine.Length - currentLine.Replace(delimiter, "").Length);
// if the number is higher than the previous maximum set the new maximum
if (maxCount < chars)
{
maxCount = chars;
// the maximum has changed, reset the number of lines in a row with the same delimiter
numberOfLinesSameLength = 0;
}
else
{
// the maximum has not changed, add to the number of lines in a row with the same delimiter
numberOfLinesSameLength += 1;
}
// is the number of lines parsed in a row with the same number of delimiter chars above the safe limit? If so break the loop
if (numberOfLinesSameLength > safeSameLengthBreak)
{
break;
}
}
}
// reset the line count
lineCount = 0;
// open a writer for the duration of the next read
using (var writer = new StreamWriter(writeFilePath))
{
using (var reader = new StreamReader(readFilePath))
{
// check the file is still being read
while (!reader.EndOfStream)
{
// append the line count (for debugging)
lineCount += 1;
// get the current line
currentLine = reader.ReadLine();
// get the number of chars in the new line
int chars = (currentLine.Length - currentLine.Replace(delimiter, "").Length);
// check the number of chars in the line matches the required number
if (chars == maxCount)
{
// write line
writer.WriteLine(currentLine);
// clear the previous line variable as this was a valid write
previousLine = string.Empty;
}
else
{
// add the line to problems
problems.Add(currentLine);
// append the new line to the previous line
previousLine += currentLine;
// get the number of chars in the new appended previous line
int newPreviousChars = (previousLine.Length - previousLine.Replace(delimiter, "").Length);
// check the number of chars in the previous appended line matches the required number
if (newPreviousChars == maxCount)
{
// write line
writer.WriteLine(previousLine);
// clear the previous line as this was a valid write
previousLine = string.Empty;
}
else if (newPreviousChars > maxCount)
{
// the number of delimiter chars in the new line is higher than the file maximum, add to rejects
rejects.Add(previousLine);
// clear the previous line and move on
previousLine = string.Empty;
}
}
}
}
}
// rename the original file as _original
System.IO.File.Move(readFilePath, readFilePath.Replace(".txt", "") + "_Original.txt");
// rename the new file as the original file name
System.IO.File.Move(writeFilePath, readFilePath);
// Write rejects
using (var rejectWriter = new StreamWriter(rejectsFilePath))
{
// loop through the problem array list and write the problem row to the problem file
foreach (string reject in rejects)
{
rejectWriter.WriteLine(reject);
}
}
// Write problems
using (var problemWriter = new StreamWriter(problemFilePath))
{
// loop through the reject array list and write the reject row to the problem file
foreach (string problem in problems)
{
problemWriter.WriteLine(problem);
}
}
}
任何指针都会非常感激。
提前致谢。
答案 0 :(得分:2)
一些想法
List<String>
对于拒绝和问题,并为您认为他们需要
分配初始容量不要通过网络进行处理 获取SSD,复制到它,处理,写入行,然后将文件复制回来
对于我来说,计算分界符似乎并不是一种有效的方法
int chars = (currentLine.Length - currentLine.Replace(delimiter, "").Length);
这很浪费:currentLine.Replace(分隔符,&#34;&#34;)
int chars = 0;
foreach(char c in currentLine) if (c == delimeter) chars++;
效率不高
previousLine += currentLine;
使用StringBuilder
并在循环外部分配StringBuilder
在循环调用.Clear()