Question

我被分配了一个项目，该项目需要一个用于操作文本文件的c＃控制台应用程序。文本文件是bcp表转储。该计划应该能够：

根据用户提供的列
在输出中包含或排除拆分列

目前，我正在阅读文件：

var groupQuery = from name in File.ReadAllLines(fileName)
                                .Skip(skipHeaderRow)
                             let n = name.Split(delimiterChars)
                             group name by n[index] into g
                             // orderby g.Key
                             select g;

我担心我可能会遇到内存问题，因为有些文件可能有200多万个，每行约有2617个字节

Answer 1

如果您确信您的程序只需要依次访问... bcp转储文件，请使用StreamReader类来读取该文件。此类针对顺序访问进行了优化，它将文件作为流打开，因此内存问题不应该打扰您。此外，您可以通过从此类的其他构造函数初始化来增加流的缓冲区大小，以便在内存中使用更大的块来处理。

如果您希望随机访问您的文件...... 请转到Memory Mapped Files。确保在文件的有限部分创建视图访问器。 MMF链接中给出的示例代码解释了如何在大文件上创建小视图。

编辑我在答案中使用了MMF的代码，但是我已经意识到已经将其删除了......尽管实际上 group by 是懒惰的，它也是non-streaming LINQ运算符。因此，它必须读取你的整个bcp转储，最后给你结果。这意味着：

StreamReader显然是一种更好的方法 。确保将缓冲区增加到最大值;
你的LINQ需要花费一些时间才能按操作符进入组，并且只有在完成整个文件读取后才会恢复生命。

Answer 2

尝试使用缓冲流来读取/写入文件，而无需将其完全加载到内存中。

using (FileStream fs = File.Open(inputFile, FileMode.Open, FileAccess.Read, FileShare.ReadWrite)) {
        using (StreamReader sr = new StreamReader(fs)) {
            string line = sr.ReadLine();
            string lineA = null;
            string lineB = null;
            while ((line != null)) {
                // Split your line here into lineA and lineB
                // and write using buffered writer.
                line = sr.ReadLine();
            }
        }
}

（来自here）

我们的想法是逐行读取文件，而不是将整个内容加载到内存中，然后按照您想要的方式将其拆分，然后逐行将分割的行写入输出文件。

Answer 3

不要重新发明轮子。考虑使用像FileHelpers这样的库。

http://www.filehelpers.net/example/QuickStart/ReadWriteRecordByRecord/

var engine = new FileHelperAsyncEngine<Customer>();

using(engine.BeginReadFile(fileName))
{
    var groupQuery =
        from o in engine
        group name by o.CustomerId into g
        // orderby g.Key
        select g;   

    foreach(Customer cust in engine)
    {
        Console.WriteLine(cust.Name);
    }
}

您的组和订单功能仍然会遇到内存问题，因为所有记录都需要在内存中进行分组和排序。

在c＃

3 个答案: