如何从行超过200K且文件大小为1 GB的文本文件中删除重复行?

时间:2013-06-14 15:06:45

标签: c# file csv file-io

目前我正在使用以下代码......它仅用于300行文本文件......执行此程序代码需要2分钟...但我的文本文件有超过200k行(行),所以此代码不适用于该文件...所以任何人都可以帮助我解决这个问题...提前感谢..

string[] source = System.IO.File.ReadAllLines(@"C:\Documents and Settings\finaloutput.txt");      

var q1 = (from line in source
          let fields = line.Split(',')
          select new
          {
              autoid = fields[0],
              ATMID = fields[4],
              DATE = fields[2],
              TIME = fields[3],
              CARDNo = fields[5],
              TRANSId = fields[6],
              SEQNo = fields[7],
              TRANSIT = fields[8],
              CheckNo = fields[9],
              CATEGORY = fields[10],
              SCORE = fields[11],
              //THRESHOLD = fields[12]
          });


    var ids = (from d in q1
               where d.CATEGORY != "Accepted"
               group d by new { d.ATMID, d.DATE, d.CARDNo, d.TRANSIT, d.CheckNo } into grp
               select grp.Min(x => x.autoid));


    var toDelete = (from d in q1
                    where !ids.Contains(d.autoid) && d.CATEGORY != "Accepted"
                    select d.autoid);

    // source1.DeleteOnSubmit(toDelete);

    var distinct = (from d in q1
                    where !toDelete.Contains(d.autoid)
                    select d);



    // Makes a list of the DeletedFields  
    // var list_Of_CSV_ItemsDeleted = distinct.Select(x => string.Join(",", x.autoid));

    // Makes a list of the distinct Fields  
    var list_Of_CSV_ItemsDistinct = distinct.Select(x => string.Join(",", x.autoid, x.ATMID, x.DATE, x.TIME, x.CARDNo, x.TRANSId, x.SEQNo, x.TRANSIT, x.CheckNo, x.CATEGORY, x.SCORE)); 
    System.IO.File.WriteAllLines(@"C:\Documents and Settings\distict1.txt", list_Of_CSV_ItemsDistinct);

1 个答案:

答案 0 :(得分:1)

我不打算为你重写这个,但是你需要做的一件事就是利用延迟执行。请考虑以下代码:

var enumerable = File.ReadLines(filePath);

这会返回一个IEnumerable<string>,因此当您要求时,它只会从文件中读取一行。现在考虑这段代码:

var next100 = enumerable.Take(100);

这将需要100行,让你使用它们。这就是你必须要做的事情。您几乎可以使用相同的LINQ查询,但一次只能使用一个部分。

所以,而不是像这样:

var q1 = (from line in source ...

它可能必须是这样的:

var q1 = (from line in source.Take(100) ...