在巨大的文件中合并CSV行

时间:2015-07-09 13:26:38

标签: c# regex node.js csv sed

我有一个看起来像这样的CSV

783582893T,2014-01-01 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-01 00:15,1,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-01 00:30,2,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:00,0,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:15,1,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:30,2,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
...
783582893T,2014-01-02 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-02 00:15,1,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-02 00:30,2,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y

虽然有50亿条记录。如果您注意到第一列和第二列的一部分(当天),则其中三个记录全部“分组”在一起,只是当天前30分钟的15分钟间隔细分。

我希望输出看起来像

783582893T,2014-01-01 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:00,0,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
...
783582893T,2014-01-02 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y

省略重复行的前4列,其余列与其类型的第一个记录组合。基本上我转换的每一天是15分钟,每行是1天。

由于我将处理50亿条记录,我认为最好的方法是使用正则表达式(和EmEditor)或为此做的一些工具(多线程,优化),而不是自定义编程解决方案。虽然我对nodeJS或C#中相对简单且超快的想法持开放态度。

如何做到这一点?

3 个答案:

答案 0 :(得分:2)

如果总是有一定数量的记录记录并且它们是有序的,那么一次只读几行并解析并输出它们相当容易。试图对数十亿条记录进行正则表达式将需要永远。使用StreamReaderStreamWriter可以读取和写入这些大文件,因为它们一次只能读写一行。

using (StreamReader sr = new StreamReader("inputFile.txt")) 
using (StreamWriter sw = new StreamWriter("outputFile.txt"))
{
    string line1;
    int counter = 0;
    var lineCountToGroup = 3; //change to 96
    while ((line1 = sr.ReadLine()) != null) 
    {
        var lines = new List<string>();
        lines.Add(line1);
        for(int i = 0; i < lineCountToGroup - 1; i++) //less 1 because we already added line1
            lines.Add(sr.ReadLine());

        var groupedLine = lines.SomeLinqIfNecessary();//whatever your grouping logic is
        sw.WriteLine(groupedLine);
    }
}

免责声明 - 未经测试的代码,没有错误处理,并假设确实有重复的行数,等等。您显然需要针对您的具体情况进行一些调整。

答案 1 :(得分:1)

你可以做这样的事情(没有任何错误处理的未经测试的代码 - 但应该给你一般的要点):

using (var sin = new SteamReader("yourfile.csv")
using (var sout = new SteamWriter("outfile.csv")
{
    var line = sin.ReadLine();    // note: should add error handling for empty files
    var cells = line.Split(",");  // note: you should probably check the length too!
    var key = cells[0];           // use this to match other rows
    StringBuilder output = new StringBuilder(line);   // this is the output line we build
    while ((line = sin.ReadLine()) != null) // if we have more lines
    {
        cells = line.Split(",");    // split so we can get the first column
        while(cells[0] == key)      // if the first column matches the current key
        {
            output.Append(String.Join(",",cells.Skip(4)));   // add this row to our output line
        }
        // once the key changes
        sout.WriteLine(output.ToString());      // write out the line we've built up
        output.Clear();
        output.Append(line);         // update the new line to build
        key = cells[0];              // and update the key
    }
    // once all lines have been processed
    sout.WriteLine(output.ToString());    // We'll have just the last line to write out
}

我们的想法是依次遍历每一行并跟踪第一列的当前值。当该值发生变化时,您会写出您正在构建的output行并更新key。这样你就不必担心你有多少匹配,或者你可能会错过几点。

请注意,如果要连接96行,则StringBuilder使用output而不是String可能更有效。

答案 2 :(得分:0)

定义ProcessOutputLine以存储合并的行。 在每个ReadLine之后和文件末尾调用ProcessLine。

string curKey     =""   ; 
string keyLength  = ... ; // set totalength of 4 first columns
string outputLine = ""  ;

private void ProcessInputLine(string line)
{
  string newKey=line.substring(0,keyLength) ;
  if (newKey==curKey) outputline+=line.substring(keyLength) ;
  else 
  { 
    if (outputline!="") ProcessOutPutLine(outputLine)
    curkey = newKey ;
    outputLine=Line ;
}
编辑:这个解决方案与 Matt Burland 非常相似,唯一值得注意的区别是我没有使用Split功能。