需要为特定任务选择适当的集合类

时间:2013-12-09 23:59:07

标签: c# .net

我正在为以下问题寻找设计解决方案:

我有一大堆项目需要与其他项目进行比较才能找到交集和异常集。同时,此项目的内部状态可能会在运行时发生变化,但此状态不会影响项目的标识。

我会使用类似HashSet<T>的内容来运行ExceptIntersect操作并快速添加项目,但我无法更新项目的状态,因为没有从中获取元素的操作集合。

我会使用Dictionary<string, T>快速添加项目并快速访问它们以更改其状态,但是没有为IDictionary提供设置的比较操作。

您如何解决问题,牢记性能考虑?

3 个答案:

答案 0 :(得分:2)

正如我在上面的评论中指出的那样,所有值都具有相同键的事实意味着所有IDictionary<string, T>将具有相同的KeyValuePair<string, T>,因此您可以使用扩展方法。

更多的是,人们还可以利用fac,即每个项目的固定密钥的保证意味着您可以仅基于密钥进行设置操作。这允许您使用以下内容快速复制ISet<T>方法:

//Null-checks omitted for brevity:
public static class DictionaryAsSet
{
  //Note that some, but not all, of these methods allow one to use two dictionaries
  //with different types of value, as long as they've the same type of key.
  //They also assume that the same `IEqualityComparer<TKey>` is used, and will be
  //weird in results otherwise.
  public static void ExceptWithByKey<TKey, TValue1, TValue2>(this IDictionary<TKey, TValue1> dictionary, IDictionary<TKey, TValue2> other)
  {
    if(dictionary.Count != 0)
    {
      if(dictionary == (object)other)
        dictionary.Clear();
      else
        foreach(TKey key in other.Keys)
          dictionary.Remove(key);
    }
  }
  public static void IntersectWithByKey<TKey, TValue1, TValue2>(this IDictionary<TKey, TValue1> dictionary, IDictionary<TKey, TValue2> other)
  {
    if(dictionary.Count != 0 && dictionary != (object)other )
    {
      List<TKey> toRemove = new List<TKey>();
      foreach(TKey key in other.Keys)
        if(!dictionary.ContainsKey(key))
          toRemove.Add(key);
      if(toRemove.Count == dictionary.Count)
        dictionary.Clear();
      else
        foreach(TKey key in toRemove)
          dictionary.Remove(key);
    }
  }
  public static bool IsSubsetOfByKey<TKey, TValue1, TValue2>(this IDictionary<TKey, TValue1> dictionary, IDictionary<TKey, TValue2> other)
  {
    if(dictionary.Count == 0 || dictionary == (object)other)
      return true;
    if(dictionary.Count > other.Count)
      return false;
    foreach(TKey key in dictionary.Keys)
      if(!other.ContainsKey(key))
        return false;
    return true;
  }
  public static bool IsProperSubsetOfByKey<TKey, TValue1, TValue2>(this IDictionary<TKey, TValue1> dictionary, IDictionary<TKey, TValue2> other)
  {
    return dictionary.Count < other.Count && dictionary.IsSubsetOfByKey(other);
  }
  public static bool IsSupersetOfByKey<TKey, TValue1, TValue2>(this IDictionary<TKey, TValue1> dictionary, IDictionary<TKey, TValue2> other)
  {
    return other.IsSubsetOfByKey(dictionary);
  }
  public static bool IsProperSupersetOfByKey<TKey, TValue1, TValue2>(this IDictionary<TKey, TValue1> dictionary, IDictionary<TKey, TValue2> other)
  {
    return other.IsProperSubsetOfByKey(dictionary);
  }
  public static bool OverlapsByKey<TKey, TValue1, TValue2>(this IDictionary<TKey, TValue1> dictionary, IDictionary<TKey, TValue2> other)
  {
    if(dictionary.Count == 0 || other.Count == 0)
      return true;
    if(dictionary == (object)other)
      return true;
    foreach(TKey key in dictionary.Keys)
      if(other.ContainsKey(key))
        return true;
    return false;
  }
  public static bool SetEqualsByKey<TKey, TValue1, TValue2>(this IDictionary<TKey, TValue1> dictionary, IDictionary<TKey, TValue2> other)
  {
    if(dictionary == (object)other)
      return true;
    if(dictionary.Count != other.Count)
      return false;
    foreach(TKey key in dictionary.Keys)
      if(!other.ContainsKey(key))
        return false;
    return true;
  }
  public static void SymmetricExceptWithByKey<TKey, TValue>(this IDictionary<TKey, TValue> dictionary, IDictionary<TKey, TValue> other)
  {
    if(dictionary.Count == 0)
      dictionary.UnionWithByKey(other);
    else if(dictionary == other)
      dictionary.Clear();
    else
    {
      List<TKey> toRemove = new List<TKey>();
      List<KeyValuePair<TKey, TValue>> toAdd = new List<KeyValuePair<TKey, TValue>>();
      foreach(var kvp in other)
        if(dictionary.ContainsKey(kvp.Key))
          toRemove.Add(kvp.Key);
        else
          toAdd.Add(kvp);
      foreach(TKey key in toRemove)
        dictionary.Remove(key);
      foreach(var kvp in toAdd)
        dictionary.Add(kvp.Key, kvp.Value);
    }
  }
  public static void UnionWithByKey<TKey, TValue>(this IDictionary<TKey, TValue> dictionary, IDictionary<TKey, TValue> other)
  {
    foreach(var kvp in other)
      if(!dictionary.ContainsKey(kvp.Key))
        dictionary.Add(kvp.Key, kvp.Value);
  }
}

其中大多数应该在效率上与HashSet<T>具有可比性,尽管我们无法通过访问自己的内部结构来HashSet<T>做一些优化。

或者,如果您更喜欢System.Linq.Enumerable扩展方法的工作方式,则可以为此特定方案创建它们的优化版本。 E.g:

public static class DictionaryAsSetEnumerable
{
  //we could also return IEnumerable<KeyValuePair<TKey, TValue1>> if we wanted
  public static IEnumerable<TValue1> Except<TKey, TValue1, TValue2>(this IDictionary<TKey, TValue1> dictionary, IDictionary<TKey, TValue2> other)
  {
    if(dictionary.Count != 0 && dictionary != (object)other)
    {
       foreach(var kvp in dictionary)
         if(!other.ContainsKey(kvp.Key))
           yield return kvp.Value;
    }
  }
  //And so on. The approach for each here should be clear from those above 
}

Enumerable.Except()的实施相比,应该表明这更快,能够做出一些假设Enumerable.Except不能。

最后一种方法是组合集合对象。在这里,我们创建一个表示每个方法的类。 E.g:

public static class DictionarySetExtensions
{
  public static IDictionary<TKey, TValue> ExceptByKey<TKey, TValue>(this IDictionary<TKey, TValue> dictionary, IDictionary<TKey, TValue> other)
  {
    return new ExceptDictionary<TKey, TValue>(dictionary, other);
  }
  private class ExceptDictionary<TKey, TValue> : IDictionary<TKey, TValue>
  {
    private readonly IDictionary<TKey, TValue> _source;
    private readonly IDictionary<TKey, TValue> _exclude;
    public ExceptDictionary(IDictionary<TKey, TValue> source, IDictionary<TKey, TValue> exclude)
    {
      _source = source;
      _exclude = exclude;
    }
    public TValue this[TKey key]
    {
      get
      {
        if(_exclude.ContainsKey(key))
          throw new KeyNotFoundException();
        return _source[key];
      }
      //A non-readonly version is possible, but probably ill-advised. This sort of
      //approach creates surprises if you don't use immutable results.
      set { throw new InvalidOperationException("Read Only Dictionary"); }
    }
    ICollection<TKey> IDictionary<TKey, TValue>.Keys
    {
      get
      {
        //there are more efficient approaches by creating a wrapper
        //class on this again, but this shows the principle.
        return this.Select(kvp => kvp.Key).ToList();
      }
    }
    ICollection<TValue> IDictionary<TKey, TValue>.Values
    {
      get
      {
        return this.Select(kvp => kvp.Value).ToList();
      }
    }
    //Note that Count is O(n), not O(1) as usual with collections.
    public int Count
    {
      get
      {
        int tally = 0;
        using(var en = GetEnumerator())
          while(en.MoveNext())
            ++tally;
        return tally;
      }
    }
    bool ICollection<KeyValuePair<TKey, TValue>>.IsReadOnly
    {
      get { return true; }
    }
    public bool ContainsKey(TKey key)
    {
      return _source.ContainsKey(key) && !_exclude.ContainsKey(key);
    }
    void IDictionary<TKey, TValue>.Add(TKey key, TValue value)
    {
      throw new InvalidOperationException("Read only");
    }
    bool IDictionary<TKey, TValue>.Remove(TKey key)
    {
      throw new InvalidOperationException("Read only");
    }
    public bool TryGetValue(TKey key, out TValue value)
    {
      if(_exclude.ContainsKey(key))
      {
        value = default(TValue);
        return false;
      }
      return _source.TryGetValue(key, out value);
    }
    void ICollection<KeyValuePair<TKey, TValue>>.Add(KeyValuePair<TKey, TValue> item)
    {
      throw new InvalidOperationException("Read only");
    }
    void ICollection<KeyValuePair<TKey, TValue>>.Clear()
    {
      throw new InvalidOperationException("Read only");
    }
    public bool Contains(KeyValuePair<TKey, TValue> item)
    {
      TValue cmp;
      return TryGetValue(item.Key, out cmp) && Equals(cmp, item.Value);
    }
    public void CopyTo(KeyValuePair<TKey, TValue>[] array, int arrayIndex)
    {
      //Way lazy here for demonstration sake. This is the sort of use of ToList() I hate, but you'll get the idea.
      this.ToList().CopyTo(array, arrayIndex);
    }
    bool ICollection<KeyValuePair<TKey, TValue>>.Remove(KeyValuePair<TKey, TValue> item)
    {
      throw new InvalidOperationException("Read only");
    }
    public IEnumerator<KeyValuePair<TKey, TValue>> GetEnumerator()
    {
      foreach(var kvp in _source)
        if(!_exclude.ContainsKey(kvp.Key))
          yield return kvp;
    }
    IEnumerator IEnumerable.GetEnumerator()
    {
      return GetEnumerator();
    }
  }
}

使用此方法,调用ExceptByKey将返回一个新对象,其行为就像包含set-operation异常一样。调用UnionByKey将返回采用相同方法的其他类的实例,依此类推。当然,你必须为每个这样的方法创建一个新类,但如果你从一个抽象的基础开始,这可能会非常快:

internal abstract class ReadOnlyDictionaryBase<TKey, TValue> : IDictionary<TKey, TValue>
{
  public TValue this[TKey key]
  {
    get
    {
      TValue value;
      if(!TryGetValue(key, out value))
        throw new KeyNotFoundException();
      return value;
    }
  }
  TValue IDictionary<TKey, TValue>.this[TKey key]
  {
    get { return this[key]; }
    set { throw new InvalidOperationException("Read only"); }
  }
  public ICollection<TKey> Keys
  {
    get { return this.Select(kvp => kvp.Key).ToList(); }
  }
  public ICollection<TValue> Values
  {
    get { return this.Select(kvp => kvp.Value).ToList(); }
  }
  public int Count
  {
    get
    {
      int tally = 0;
      using(var en = GetEnumerator())
        while(en.MoveNext())
          ++tally;
      return tally;
    }
  }
  bool ICollection<KeyValuePair<TKey, TValue>>.IsReadOnly
  {
    get { return true; }
  }
  public bool ContainsKey(TKey key)
  {
    TValue unused;
    return TryGetValue(key, out unused);
  }
  void IDictionary<TKey, TValue>.Add(TKey key, TValue value)
  {
    throw new NotSupportedException("Read only");
  }
  bool IDictionary<TKey, TValue>.Remove(TKey key)
  {
    throw new NotSupportedException("Read only");
  }
  public abstract bool TryGetValue(TKey key, out TValue value);
  void ICollection<KeyValuePair<TKey, TValue>>.Add(KeyValuePair<TKey, TValue> item)
  {
    throw new NotSupportedException("Read only");
  }
  void ICollection<KeyValuePair<TKey, TValue>>.Clear()
  {
    throw new NotSupportedException("Read only");
  }
  public bool Contains(KeyValuePair<TKey, TValue> item)
  {
    TValue value;
    return TryGetValue(item.Key, out value) && Equals(value, item);
  }
  public void CopyTo(KeyValuePair<TKey, TValue>[] array, int arrayIndex)
  {
    this.ToList().CopyTo(array, arrayIndex);
  }
  bool ICollection<KeyValuePair<TKey, TValue>>.Remove(KeyValuePair<TKey, TValue> item)
  {
    throw new NotSupportedException("Read only");
  }
  public abstract IEnumerator<KeyValuePair<TKey, TValue>> GetEnumerator();
  IEnumerator IEnumerable.GetEnumerator()
  {
    return GetEnumerator();
  }
}

然后你只需要实现TryGetValue()GetEnumerable()来实现一个类,例如:

internal class  UnionDictionary<TKey, TValue> : ReadOnlyDictionaryBase<TKey, TValue>
{
  private readonly IDictionary<TKey, TValue> _first;
  private readonly IDictionary<TKey, TValue> _second;
  public UnionDictionary(IDictionary<TKey, TValue> first, IDictionary<TKey, TValue> second)
  {
    _first = first;
    _second = second;
  }
  public override bool TryGetValue(TKey key, out TValue value)
  {
    return _first.TryGetValue(key, out value) || _second.TryGetValue(key, out value);
  }
  public override IEnumerator<KeyValuePair<TKey, TValue>> GetEnumerator()
  {
    foreach(var kvp in _first)
      yield return kvp;
    foreach(var kvp in _second)
      if(!_first.ContainsKey(kvp.Key))
        yield return kvp;
  }
}

虽然您可能希望将某些成员设为虚拟成员,然后使用优化覆盖它们,例如通过以上UnionDictionary,我们可以从中受益:

public override int Count
{
  get
  {
    int tally = _first.Count;//O(1) if _first has an O(1) Count
    foreach(var kvp in _second)
      if(!_first.ContainsKey(kvp.Key))
        ++tally;
    return tally;
  }
}

有趣的是,不同任务的相对效率与其他方法完全不同:结果以O(1)时间而不是O(n)或O(n + m)返回。其他案件。对对象的大多数调用也是O(1),虽然仍然比调用原始字典要慢,而Count已经从O(1)变为O(n)。

另外值得注意的是,这些对象的效率越低,其中的源对象就越多。因此,如果我们要使用一些小字典并对其进行大量基于集合的操作,这种方法很快就会变慢,因为对方法的调用最终会有越来越多的工作要做。另一方面,如果我们有大量的字典并对它们进行一些设置操作,那么这种方法可以快得多,因为我们在复制,分配和迭代序列方面几乎没有。

这种方法还有一个有趣的优点和有趣的缺点。

有趣的优点是,这可以提供出色的线程安全性。由于所有这些操作都会从参数中生成不可变对象,而这些参数也不会发生变异,因此您可以让数百个线程处理共享字典,而不会有任何变异的风险。当然,改变源Dictionary的人会毁掉所有这些线程的东西,但是这可以通过在创建之后不改变它们来避免,或者通过强制执行它:

public ExceptDictionary(IDictionary<TKey, TValue> source, IDictionary<TKey, TValue> exclude, IEqualityComparer<TKey> comparer)
{
  _source = source.IsReadOnly ? source : source.ToDictionary(kvp => kvp.Key, kvp => kvp.Value, comparer);
   _exclude = exclude.IsReadOnly ? exclude : exclude.ToDictionary(kvp => kvp.Key, kvp => kvp.Value, comparer);
}

可悲的是,这只有在我们知道我们正在使用的比较器时才有效。它的另一个优点是,如果我们知道源字典不会有任何变异,那么我们可以记住更昂贵的调用,例如Count第一次只需要O(n),后续调用可以是O(1)。

(相反,虽然不是线程安全的,但相反也可能有用;根据应用程序状态的变化更改一些源字典,并且表示集合操作的对象会自动更新。)

有趣的缺点是垃圾收集有多糟糕。这种通用方法在垃圾收集方面通常非常好,因为可能在多个地方重用相同的集合。这虽然不是一个例子,因为我们可以让内存中的对象纯粹表示一个键没有匹配的值,或者在一个联合的两个源上重复,等等,你可以拥有大量的操作用于创建仅在语义上包含少量元素的结构的内存演出。您可以通过定期将内容转储到Dictionary并允许收集废物来解决这个问题。人们应该多久做一次平衡 - 往往错过了这种方法的全部要点,而很少会留下大量浪费。

一种方法是向Depth添加一个内部可见的ReadOnlyDictionaryBase字段,我们在构建时将其设置为:

public static IDictionary<TKey, TValue> UnionByKey<TKey, TValue>(this IDictionary<TKey, TValue> first, IDictionary<TKey, TValue> second)
{
  var firstRO = first as ReadOnlyDictionaryBase<TKey, TValue>;
  var secondRO = second as ReadOnlyDictionaryBase<TKey, TValue>;
  depth = (firstRO == null ? 1 : firstRO.Depth) + (secondRO == null ? 1 : secondRO.Depth);
  var result = new UnionDictionary<TKey, TValue>(first, second, depth);
  return depth > MAX_DEPTH ? result.DumpToDictionary() : result;
}

答案 1 :(得分:1)

  

我有一大堆项目需要与其他项目进行比较才能找到交集和异常集。同时,此项目的内部状态可能会在运行时发生变化,但此状态不会影响项目的标识。

虽然从技术上讲,您可以更改Dictionary中存在或HashSet中存在的对象,只要在您的对象中没有使用任何已更改的内部数据,它就会没问题。 GetHashCodeEquals方法,这似乎是一种非常奇怪的做事方式。我会劝阻你这样做,并建议将你的对象分开。

为什么呢?几年前我构建了一些框架类型代码,其中对象相等基于某些而不是对象字段的所有(这与你描述的内容类似一些属性组成了ID,而其他属性只构成额外的数据),并且从那时起它引起了很多错误,因为其他开发人员一直对它感到惊讶和困惑。我从中学到的是,C#开发人员基本上都希望对象具有:

  • 仅限参考平等
  • 基于所有领域的“深度”平等。

因为它不仅仅是参考平等,人们会改变一个“额外”字段,然后想知道为什么他们的2个对象仍然相等,即使额外的字段不同。

关于如何拆分的建议

将关键部分放入不可变的类或结构中,并使用包含可变数据的第二个类。然后,您应该能够将所有关键部分放入Dictionary,并更新可变数据而不会引起问题(或混淆)。

你必须编写自己的Except / Intersect方法,但这不应该太难。

作为一个例子,而不是:

public class Item {
    readonly int key1;
    readonly  string key2;

    string extra1;
    DateTime extra2;

    public override Equals(Object other) {
        var otherItem = other as Item;
        if(otherItem == null)
            return false;

        return key1 == other.key1 && key2 == other.key2
    } // and equivalent GetHashCode which only checks key1 and key2
}

var data = new HashSet<Item>(); ...

你可以有这样的东西

public class ItemKey {
    readonly int key1;
    readonly string key2;

    // implement equals, gethashcode, etc
}

public class ItemData {
    string extra1;
    DateTime extra2;

    // don't implement equals, just rely on reference equality here
}

var data = new Dictionary<ItemKey, ItemData>() ...

然后,您可以根据密钥单独执行Intersect之类的散列集操作,并在执行此操作时将ItemData带上。

答案 2 :(得分:0)

我建议使用HashSet。

Except() and Intersect() with other set. 
Add() for adding new element.
ToList() (extension method) for accessing each elements in the set.