在每2000个字符后读取并处理大型文本文件,并使用新行

时间:2018-02-20 07:55:47

标签: c# .net large-files

我有一个大文本文件,应该在每个 2000 字符之后处理一个新行,我已经完成了

string FilePath = Path.Combine(strFullProcessedPath, strFileName);
StreamReader reader = new StreamReader(FilePath);
string firstLine = reader.ReadLine();
if (firstLine.Length > 2000)
{
    string text = File.ReadAllText(FilePath);
    text = Regex.Replace(text, @"(.{2000})", "$1\r\n", RegexOptions.Multiline);
    reader.Close();
    File.WriteAllText(FilePath, text);
}

它正在给予

  

内存不足异常

拜托,有人,请给我一些建议

2 个答案:

答案 0 :(得分:1)

如果非常大(多GB)文件不适合内存,您可以尝试将处理后的数据存储到临时文件中。避免ReadAllText,但在buffer的帮助下进行读写(在上下文中方便成为2000个字符)

  // Initial and target file
  string FilePath = Path.Combine(strFullProcessedPath, strFileName); 
  // Temporary file 
  string tempFile = Path.ChangeExtension(FilePath, ".~temp");        

  char[] buffer = new char[2000];

  using (StreamReader reader = new StreamReader(FilePath)) {
    bool first = true;

    using (StreamWriter writer = new StreamWriter(tempFile)) {
      while (true) {
        int size = reader.ReadBlock(buffer, 0, buffer.Length);

        if (size > 0) {  // Do we have anything to write?
          if (!first) // Are we in the middle and have to add a new line?
            writer.WriteLine();

          for (int i = 0; i < size; ++i)
            writer.Write(buffer[i]);
        }

        // The last (incomplete) chunk
        if (size < buffer.Length)
          break;

        first = false;
      }
    }
  }

  File.Delete(FilePath);
  // Move temporary file into target one
  File.Move(tempFile, FilePath);
  // And finally removing temporary file 
  File.Delete(tempFile);

编辑:即使您没有那么大(300MB,请参阅注释),也要避免字符串处理(初始字符串的几个副本可以很好地导致内存不足。

像这样的东西

private static IEnumerable<string> ToChunks(string text, int size) {
  int n = text.Length / size + (text.Length % size == 0 ? 0 : 1);

  for (int i = 0; i < n; ++i)
    if (i == n - 1)
      yield return text.Substring(i * size);       // Last chunk
    else
      yield return text.Substring(i * size, size); // Inner chunk  
}

...

string FilePath = Path.Combine(strFullProcessedPath, strFileName);

// Read once, do not Replace ao do something with the string
string text = File.ReadAllText(FilePath);

// ... but extracting 2000 char chunks
File.WriteAllLines(FilePath, ToChunks(text, 2000));

答案 1 :(得分:0)

您不能简单地将换行符插入到退出文件中 - 基本上您需要重写整个内容。最简单的方法是使用两个文件 - 源和目标 - 然后最后删除和重命名(因此临时目标文件采用原始名称)。这意味着您现在可以循环遍历源文件,而无需将其全部读入内存;本质上,作为伪代码:

using(...open source for read...)
using(...create dest for write...)
{
    char[] buffer = new char[2000];
    int charCount;
    while(TryBuffer(source, buffer, out charCount)) {
        // if true, we filled the buffer; don't need to worry
        // about charCount
        Write(destination, buffer, buffer.Length);
        Write(destination, CRLF);
    }
    if(charCount != 0) // final chunk when returned false
    {
        // write any remaining charCount chars as a final chunk
        Write(destination, buffer, charCount);
    }
}

这样就留下了TryBufferWrite的实现。在这种情况下,TextReaderTextWriter可能是你的朋友,因为你处理的是字符而不是字节。