我有一个C#对象列表,其中包含以下简化数据:
ID, Price
2, 80.0
8, 44.25
14, 43.5
30, 79.98
54, 44.24
74, 80.01
我正在尝试将GroupBy设为最小数字,同时考虑容差因子。 例如,在容差= 0.02的情况下,我的预期结果应为:
44.24 -> 8, 54
43.5 -> 14
79.98 -> 2, 30, 74
如何在实现大型数据集的良好性能的同时做到这一点? 在这种情况下LINQ是否可行?
答案 0 :(得分:4)
在我看来,如果你有一个大型数据集,你会想要避免直接排序值的解决方案,然后在迭代排序列表时收集它们,因为对大型集合进行排序可能会很昂贵。我能想到的最有效的解决办法是不进行任何明确的排序是建立一个树,其中每个节点都包含密钥位于"连续" range(所有键都在tolerance
之内) - 每次添加项目时,每个节点的范围都会扩展,超出范围小于tolerance
。我实现了一个解决方案 - 结果比我预期的更复杂和有趣 - 并且基于我粗略的基准测试,看起来这样做的时间大约是直接解决方案的一半。
这是我作为扩展方法的实现(因此您可以将其链接起来,尽管与普通Group
方法一样,它会在结果source
完全迭代IEnumerable
迭代了{1}}。
public static IEnumerable<IGrouping<double, TValue>> GroupWithTolerance<TValue>(
this IEnumerable<TValue> source,
double tolerance,
Func<TValue, double> keySelector)
{
if(source == null)
throw new ArgumentNullException("source");
return GroupWithToleranceHelper<TValue>.Group(source, tolerance, keySelector);
}
private static class GroupWithToleranceHelper<TValue>
{
public static IEnumerable<IGrouping<double, TValue>> Group(
IEnumerable<TValue> source,
double tolerance,
Func<TValue, double> keySelector)
{
Node root = null, current = null;
foreach (var item in source)
{
var key = keySelector(item);
if(root == null) root = new Node(key);
current = root;
while(true){
if(key < current.Min - tolerance) { current = (current.Left ?? (current.Left = new Node(key))); }
else if(key > current.Max + tolerance) {current = (current.Right ?? (current.Right = new Node(key)));}
else
{
current.Values.Add(item);
if(current.Max < key){
current.Max = key;
current.Redistribute(tolerance);
}
if(current.Min > key) {
current.Min = key;
current.Redistribute(tolerance);
}
break;
}
}
}
foreach (var entry in InOrder(root))
{
yield return entry;
}
}
private static IEnumerable<IGrouping<double, TValue>> InOrder(Node node)
{
if(node.Left != null)
foreach (var element in InOrder(node.Left))
yield return element;
yield return node;
if(node.Right != null)
foreach (var element in InOrder(node.Right))
yield return element;
}
private class Node : IGrouping<double, TValue>
{
public double Min;
public double Max;
public readonly List<TValue> Values = new List<TValue>();
public Node Left;
public Node Right;
public Node(double key) {
Min = key;
Max = key;
}
public double Key { get { return Min; } }
IEnumerator IEnumerable.GetEnumerator() { return GetEnumerator(); }
public IEnumerator<TValue> GetEnumerator() { return Values.GetEnumerator(); }
public IEnumerable<TValue> GetLeftValues(){
return Left == null ? Values : Values.Concat(Left.GetLeftValues());
}
public IEnumerable<TValue> GetRightValues(){
return Right == null ? Values : Values.Concat(Right.GetRightValues());
}
public void Redistribute(double tolerance)
{
if(this.Left != null) {
this.Left.Redistribute(tolerance);
if(this.Left.Max + tolerance > this.Min){
this.Values.AddRange(this.Left.GetRightValues());
this.Min = this.Left.Min;
this.Left = this.Left.Left;
}
}
if(this.Right != null) {
this.Right.Redistribute(tolerance);
if(this.Right.Min - tolerance < this.Max){
this.Values.AddRange(this.Right.GetLeftValues());
this.Max = this.Right.Max;
this.Right = this.Right.Right;
}
}
}
}
}
如果需要,您可以将double
切换到另一种类型(我希望C#具有numeric
通用约束)。
答案 1 :(得分:0)
这是史蒂夫躲过的更简单的排序和收集方法的实现。
public static class EnumerableExtensions
{
public static IEnumerable<IGrouping<double, T>> GroupByWithTolerance<T>(this IEnumerable<T> source,
Func<T, double> keySelector, double tolerance)
{
var orderedSource = source
.Select(e => new {Key = keySelector(e), Value = e})
.OrderBy(e => e.Key);
if (!orderedSource.Any())
yield break;
var prev = orderedSource.First();
var itemGroup = new Group<double, T>(prev.Key) {prev.Value};
foreach (var current in orderedSource.Skip(1))
{
if (current.Key - prev.Key <= tolerance)
{
itemGroup.Add(current.Value);
}
else
{
yield return itemGroup;
itemGroup = new Group<double, T>(current.Key) {current.Value};
}
prev = current;
}
yield return itemGroup;
}
private class Group<TKey, TSource> : List<TSource>, IGrouping<TKey, TSource>
{
public Group(TKey key)
{
Key = key;
}
public TKey Key { get; }
}
}
修改强>
样本用法:
[Test]
public void Test()
{
var items = new[]
{
new Item {Id = 2, Price = 80.0},
new Item {Id = 8, Price = 44.25},
new Item {Id = 14, Price = 43.5},
new Item {Id = 30, Price = 79.98},
new Item {Id = 54, Price = 44.24},
new Item {Id = 74, Price = 80.01}
};
var groups = items.GroupByWithTolerance(i => i.Price, 0.02);
foreach (var itemGroup in groups)
{
var groupString = string.Join(", ", itemGroup.Select(i => i.ToString()));
System.Console.WriteLine($"{itemGroup.Key} -> {groupString}");
}
}
private class Item
{
public int Id { get; set; }
public double Price { get; set; }
public override string ToString() => $"[ID: {Id}, Price: {Price}]";
}
输出:
43.5 -> [ID: 14, Price: 43.5]
44.24 -> [ID: 54, Price: 44.24], [ID: 8, Price: 44.25]
79.98 -> [ID: 30, Price: 79.98], [ID: 2, Price: 80], [ID: 74, Price: 80.01]
答案 2 :(得分:0)
最直接的方法是设计自己的IEqualityComparer<double>
。
public class ToleranceEqualityComparer : IEqualityComparer<double>
{
public double Tolerance { get; set; } = 0.02;
public bool Equals(double x, double y)
{
return x - Tolerance <= y && x + Tolerance > y;
}
//This is to force the use of Equals methods.
public int GetHashCode(double obj) => 1;
}
您应该这样使用
var dataByPrice = data.GroupBy(d => d.Price, new ToleranceEqualityComparer());