批量多个GroupBy

时间:2011-06-02 18:47:42

标签: c# linq linq-to-objects

我有一个CSV文件,其中包含需要排序的记录,然后分组成任意大小的批次(例如每批最多300个记录)。每批可能少于300条记录,因为每批的内容必须是同质的(基于几个不同列的内容)。

我的LINQ声明受到batching with LINQ上这个答案的启发,看起来像这样:

var query = (from line in EbrRecords
            let EbrData = line.Split('\t')
            let Location = EbrData[7]
            let RepName = EbrData[4]
            let AccountID = EbrData[0]
            orderby Location, RepName, AccountID).
            Select((data, index) => new {
                Record = new EbrRecord(
                AccountID = EbrData[0],
                AccountName = EbrData[1],
                MBSegment = EbrData[2],
                RepName = EbrData[4],
                Location = EbrData[7],
                TsrLocation = EbrData[8]
                )
                ,
                Index = index}
                ).GroupBy(x => new {x.Record.Location, x.Record.RepName, batch = x.Index / 100});    

“/ 100”给我任意的铲斗尺寸。 groupby的其他元素旨在实现批次之间的同质性。我怀疑这几乎是我想要的,但它给了我以下编译器错误:A query body must end with a select clause or a group clause。我理解为什么我收到错误,但总的来说我不知道​​如何解决这个问题。怎么做?

更新我几乎达到了我所追求的目标,具体如下:

List<EbrRecord> input = new List<EbrRecord> {
    new EbrRecord {Name = "Brent",Age = 20,ID = "A"},
    new EbrRecord {Name = "Amy",Age = 20,ID = "B"},
    new EbrRecord {Name = "Gabe",Age = 23,ID = "B"},
    new EbrRecord {Name = "Noah",Age = 27,ID = "B"},
    new EbrRecord {Name = "Alex",Age = 27,ID = "B"},
    new EbrRecord {Name = "Stormi",Age = 27,ID = "B"},
    new EbrRecord {Name = "Roger",Age = 27,ID = "B"},
    new EbrRecord {Name = "Jen",Age = 27,ID = "B"},
    new EbrRecord {Name = "Adrian",Age = 28,ID = "B"},
    new EbrRecord {Name = "Cory",Age = 29,ID = "C"},
    new EbrRecord {Name = "Bob",Age = 29,ID = "C"},
    new EbrRecord {Name = "George",Age = 29,ID = "C"},
    };

//look how tiny this query is, and it is very nearly the result I want!!!
int i = 0;
var result = from q in input
                orderby q.Age, q.ID
                group q by new { q.ID, batch = i++ / 3 };

foreach (var agroup in result)
{
    Debug.WriteLine("ID:" + agroup.Key);
    foreach (var record in agroup)
    {
        Debug.WriteLine(" Name:" + record.Name);
    }
}

这里的技巧是通过使用闭包变量(在这种情况下为int i)绕过选择“索引位置”重叠。输出结果如下:

ID:{ ID = A, batch = 0 }
 Name:Brent
ID:{ ID = B, batch = 0 }
 Name:Amy
 Name:Gabe
ID:{ ID = B, batch = 1 }
 Name:Noah
 Name:Alex
 Name:Stormi
ID:{ ID = B, batch = 2 }
 Name:Roger
 Name:Jen
 Name:Adrian
ID:{ ID = C, batch = 3 }
 Name:Cory
 Name:Bob
 Name:George

虽然这个答案是可以接受的,但它只是理想结果的一小部分。应该是第一次出现“批次'B'”应该有3个人(Amy,Gabe,Noah) - 而不是两个(Amy,Gabe)。这是因为在识别每个组时不会重置索引位置。任何人都知道如何重置每个组的自定义索引位置?

更新2 我想我可能找到了答案。首先,制作一个这样的附加功能:

    public static bool BatchGroup(string ID, ref string priorID )
    {
        if (priorID != ID)
        {
            priorID = ID;
            return true;
        }
        return false;
    }

其次,像这样更新LINQ查询:

int i = 0;
string priorID = null;
var result = from q in input
                orderby q.Age, q.ID
             group q by new { q.ID, batch = (BatchGroup(q.ID, ref priorID) ? i=0 : ++i) / 3 };

现在它做我想要的。我只是希望我不需要那个单独的功能!

2 个答案:

答案 0 :(得分:2)

这有用吗?

var query = (from line in EbrRecords
        let EbrData = line.Split('\t')
        let Location = EbrData[7]
        let RepName = EbrData[4]
        let AccountID = EbrData[0]
        orderby Location, RepName, AccountID
        select new EbrRecord(
                AccountID = EbrData[0],
                AccountName = EbrData[1],
                MBSegment = EbrData[2],
                RepName = EbrData[4],
                Location = EbrData[7],
                TsrLocation = EbrData[8])
        ).Select((data, index) => new
        {
            Record = data,
            Index = index
        })
        .GroupBy(x => new {x.Record.Location, x.Record.RepName, batch = x.Index / 100},
            x => x.Record);

答案 1 :(得分:1)

orderby Location, RepName, AccountID

如上所述,需要有一个select子句,如StriplingWarrior的回答所示。 Linq理解查询必须以select或group by结尾。


不幸的是,存在逻辑缺陷...假设我在第一组中有50个帐户,在第二组中有100个帐户,批次大小为100.原始代码将生成3批50的大小,而不是2批50,100。

这是修复它的一种方法。

IEnumerable<IGrouping<int, EbrRecord>> query = ...

  orderby Location, RepName, AccountID
  select new EbrRecord(
    AccountID = EbrData[0],
    AccountName = EbrData[1],
    MBSegment = EbrData[2],
    RepName = EbrData[4],
    Location = EbrData[7],
    TsrLocation = EbrData[8]) into x
  group x by new {Location = x.Location, RepName = x.RepName} into g
  from g2 in g.Select((data, index) => new Record = data, Index = index })
              .GroupBy(y => y.Index/100, y => y.Record)
  select g2;


List<List<EbrRecord>> result = query.Select(g => g.ToList()).ToList();

另请注意,由于冗余迭代,使用GroupBy进行批处理非常慢。您可以编写一个for循环,它将在有序集上一次传递,并且该循环将比LinqToObjects运行得快得多。