我有一个大文本文件,应该在每个 2000 字符之后处理一个新行,我已经完成了
string FilePath = Path.Combine(strFullProcessedPath, strFileName);
StreamReader reader = new StreamReader(FilePath);
string firstLine = reader.ReadLine();
if (firstLine.Length > 2000)
{
string text = File.ReadAllText(FilePath);
text = Regex.Replace(text, @"(.{2000})", "$1\r\n", RegexOptions.Multiline);
reader.Close();
File.WriteAllText(FilePath, text);
}
它正在给予
内存不足异常
拜托,有人,请给我一些建议
答案 0 :(得分:1)
如果非常大(多GB)文件不适合内存,您可以尝试将处理后的数据存储到临时文件中。避免ReadAllText
,但在buffer
的帮助下进行读写(在上下文中方便成为2000
个字符)
// Initial and target file
string FilePath = Path.Combine(strFullProcessedPath, strFileName);
// Temporary file
string tempFile = Path.ChangeExtension(FilePath, ".~temp");
char[] buffer = new char[2000];
using (StreamReader reader = new StreamReader(FilePath)) {
bool first = true;
using (StreamWriter writer = new StreamWriter(tempFile)) {
while (true) {
int size = reader.ReadBlock(buffer, 0, buffer.Length);
if (size > 0) { // Do we have anything to write?
if (!first) // Are we in the middle and have to add a new line?
writer.WriteLine();
for (int i = 0; i < size; ++i)
writer.Write(buffer[i]);
}
// The last (incomplete) chunk
if (size < buffer.Length)
break;
first = false;
}
}
}
File.Delete(FilePath);
// Move temporary file into target one
File.Move(tempFile, FilePath);
// And finally removing temporary file
File.Delete(tempFile);
编辑:即使您没有那么大(300MB,请参阅注释),也要避免字符串处理(初始字符串的几个副本可以很好地导致内存不足。
像这样的东西
private static IEnumerable<string> ToChunks(string text, int size) {
int n = text.Length / size + (text.Length % size == 0 ? 0 : 1);
for (int i = 0; i < n; ++i)
if (i == n - 1)
yield return text.Substring(i * size); // Last chunk
else
yield return text.Substring(i * size, size); // Inner chunk
}
...
string FilePath = Path.Combine(strFullProcessedPath, strFileName);
// Read once, do not Replace ao do something with the string
string text = File.ReadAllText(FilePath);
// ... but extracting 2000 char chunks
File.WriteAllLines(FilePath, ToChunks(text, 2000));
答案 1 :(得分:0)
您不能简单地将换行符插入到退出文件中 - 基本上您需要重写整个内容。最简单的方法是使用两个文件 - 源和目标 - 然后最后删除和重命名(因此临时目标文件采用原始名称)。这意味着您现在可以循环遍历源文件,而无需将其全部读入内存;本质上,作为伪代码:
using(...open source for read...)
using(...create dest for write...)
{
char[] buffer = new char[2000];
int charCount;
while(TryBuffer(source, buffer, out charCount)) {
// if true, we filled the buffer; don't need to worry
// about charCount
Write(destination, buffer, buffer.Length);
Write(destination, CRLF);
}
if(charCount != 0) // final chunk when returned false
{
// write any remaining charCount chars as a final chunk
Write(destination, buffer, charCount);
}
}
这样就留下了TryBuffer
和Write
的实现。在这种情况下,TextReader
和TextWriter
可能是你的朋友,因为你处理的是字符而不是字节。