找到.Intersect .Count>的案例的快捷方式门槛?

时间:2013-12-09 12:47:43

标签: c# .net linq

我有2个集合,我想确定交叉元素的数量是否超过某个阈值。

我目前使用此代码(执行约8500万次,因此速度 重要):

public bool isSimilarTo(....)
    int numberOfSharedPoints = pointsA.Count(pointsB.Contains);
    if (numberOfSharedPoints >= THRESHOLD) return true;

由于必须首先计算numberOfSharedPoints,因此可能效率低下。

是否有更优化的方法,例如,在达到阈值时,使用break快捷方式迭代元素?

奖金问题:

  1. 第一行代码this.pointsA.Intersect(pointsB).Count()会更快吗?
  2. 这些馆藏目前是List<> - Hashset会更快吗?

5 个答案:

答案 0 :(得分:4)

要确定交叉点的项目数是否多于THRESHOLD,您可以使用此构造:

if (pointsA.Intersect(pointsB).Skip(THRESHOLD - 1).Any())
{
    //...
}

正如Rawling在另一个答案的评论中指出的那样,Intersect将完全枚举第二个序列。因此,此解决方案的复杂性似乎是O(n + m) - nmpointsApointsB集合中的项目数量。 O(m)是构建HashSet的代价 - 所以我假设这种结构在内部使用。检查元素是否在哈希集内是恒定时间(在注释中由Ilya Ivanov指出),并且对于最坏情况场景,它最多执行m次(例如:交叉点是空的,需要检查所有元素。)

此外,如果您拥有具有恒定时间Count的具体集合,则可以尝试以下优化,如果它们的大小可能显着不同:

var shorter = pointsA;
var longer = pointsB;

//makes sense if Count() is constant time
if (shorter.Count() > longer.Count())
{
    shorter = pointsB;
    longer = pointsA;
}

if (longer.Intersect(shorter).Skip(THRESHOLD - 1).Any())
{
    //...
}

答案 1 :(得分:2)

我已经创建了一个示例来查找此处给出的每个答案的效果,包括传统的foreach循环:

在我的示例控制台应用程序中,我为pointsApointsB生成了10,000个随机浮点数。 阈值计数为100,并检查每种方法的性能,使用以下代码:

static void Main(string[] args)
{
    double totalTimeSpentIntersectAndSkip = 0;
    double totalTimeSpentHashSet = 0;
    double totalTimeSpentCount = 0;
    double totalTimeSpentWhereAndSkip = 0;
    double totalTimeSpentForEach = 0;
    int maxIteration = 1000;
    for (int j = 0; j < maxIteration; j++)
    {
        Random r = new Random();
        for (int i = 0; i < 10000; i++)
        {
            pointsA.Add(r.NextDouble());
        }

        for (int i = 0; i < 10000; i++)
        {
            pointsB.Add(r.NextDouble());
        }

        s.Reset(); s.Start();
        var timeSpentInSeconds = TestUsingIntersectAndSkip();
        s.Stop();
        Console.WriteLine("IntersectAndSkip: " + timeSpentInSeconds);
        totalTimeSpentIntersectAndSkip += timeSpentInSeconds;

        s.Reset(); s.Start();
        timeSpentInSeconds = TestUsingHashSet();
        s.Stop();
        Console.WriteLine("HashSet: " + timeSpentInSeconds);
        totalTimeSpentHashSet += timeSpentInSeconds;

        s.Reset(); s.Start();
        timeSpentInSeconds = TestUsingForEach();
        s.Stop();
        Console.WriteLine("ForEach: " + timeSpentInSeconds);
        totalTimeSpentForEach += timeSpentInSeconds;

        s.Reset(); s.Start();
        timeSpentInSeconds = TestUsingWhereAndSkip();
        s.Stop();
        Console.WriteLine("WhereAndSkip: " + timeSpentInSeconds);
        totalTimeSpentWhereAndSkip += timeSpentInSeconds;

        s.Reset(); s.Start();
        timeSpentInSeconds = TestUsingCount();
        s.Stop();
        Console.WriteLine("Count: " + timeSpentInSeconds);
        totalTimeSpentCount += timeSpentInSeconds;

        Console.WriteLine("-------------------------------------------------------------------------------");
        pointsA.Clear();
        pointsB.Clear();
    }

    Console.WriteLine("Following is Average TimeSpent by each method: "+Environment.NewLine);
    Console.WriteLine("IntersectAndSkip: " + totalTimeSpentIntersectAndSkip / maxIteration);
    Console.WriteLine("HashSet: " + totalTimeSpentHashSet / maxIteration);
    Console.WriteLine("ForEach: " + totalTimeSpentForEach / maxIteration);
    Console.WriteLine("WhereAndSkip: " + totalTimeSpentWhereAndSkip / maxIteration);
    Console.WriteLine("Count: " + totalTimeSpentCount / maxIteration);
    Console.WriteLine("-------------------------------------------------------------------------------");


}
static Stopwatch s = new Stopwatch();
const int THRESHOLD = 100;
static List<Double> pointsA = new List<double>();
static List<Double> pointsB = new List<double>();

private static double TestUsingHashSet()
{
    HashSet<double> hash = new HashSet<double>(pointsA);
    hash.IntersectWith(pointsB);
    if (hash.Count >= THRESHOLD)
    {
        return s.Elapsed.TotalSeconds;
    }
    else
    {
        return s.Elapsed.TotalSeconds;
    }
}

private static double TestUsingWhereAndSkip()
{
    if (pointsA.Where(pointsB.Contains).Skip(THRESHOLD - 1).Any())
    {
        return s.Elapsed.TotalSeconds;
    }
    else
    {
        return s.Elapsed.TotalSeconds;
    }
}

private static double TestUsingCount()
{
    int numberOfSharedPoints = pointsA.Count(pointsB.Contains);
    if (numberOfSharedPoints >= THRESHOLD)
    {
        return s.Elapsed.TotalSeconds;
    }
    else
    {
        return s.Elapsed.TotalSeconds;
    }
}

private static double TestUsingForEach()
{
    var intersectItemCount = 0;
    foreach (var d in pointsA)
    {
        if (pointsB.Contains(d)) intersectItemCount++;
        if (intersectItemCount > THRESHOLD)
        {
            return s.Elapsed.TotalSeconds;
        }
    }
    return s.Elapsed.TotalSeconds;
}

private static double TestUsingIntersectAndSkip()
{
    if (pointsA.Intersect(pointsB).Skip(THRESHOLD - 1).Any())
    {
        return s.Elapsed.TotalSeconds;
    }
    else
    {
        return s.Elapsed.TotalSeconds;
    }
}

我运行1000次并存储每次迭代的结果和平均结果,之后所有这些分析都是按性能排名:

1) Intersect with Skip
2) HashSet
3) Count (Given by OP)
4) Where and Skip
5) Foreach

enter image description here

将项目数从10,000更改为50,000(5次运行)时,除HashSet和IntersectWithSkip外,所有时间都过长。表现排名几乎相同:

enter image description here

答案 2 :(得分:0)

试试这个:

if(pointsA.Where(pointsB.Contains).Skip(THRESHOLD-1).Any()){
   //...
}

答案 3 :(得分:0)

我认为你可以通过排序的IEnumerable获得O(n),因为你只需要进行一次传递 这是Integer,但您可以使用泛型并传递比较器
在我的测试中,下面的CompareSS击败了ss1.Intersect(ss2).Skip(thresHold1 - 1).Any()by 5:1
是的5倍

使用两个HashSet和一个ForEach Contains,你也可以进行O(n)比较,但是HashSet的创建成本更高。

public static void TimeTest()
{
    int size = 20000;
    List<Int32>ss1 = new List<int>(size);
    List<Int32>ss2 = new List<int>(size);
    for(int i = 0; i < size; i++)
    {
        Int32 int1 = i;
        Int32 int2 = i + (Int32)((float)size / 2);
        ss1.Add(i);
        ss2.Add(int2);
    }
    //foreach (int iTest in ss1)
    //    System.Diagnostics.Debug.WriteLine(iTest);
    //System.Diagnostics.Debug.WriteLine("");
    //foreach (int iTest in ss2)
    //    System.Diagnostics.Debug.WriteLine(iTest);

    System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
    sw.Start();
    int thresHold1 = (Int32)((float)size / 4);
    int thresHold2 = (Int32)((float)size * 3 / 4);

    Int32 matchcount = 0;
    for(int i = 0; i <= size; i++)
    {
        if(CompareSS(ss1, ss2, thresHold1))
            matchcount++;
        if (CompareSS(ss1, ss2, thresHold2))
            matchcount++;
    }
    System.Diagnostics.Debug.WriteLine("sw.ms {0}   count {1}", sw.ElapsedMilliseconds.ToString("N0"), matchcount.ToString("N0"));
    sw.Restart();
    matchcount = 0;
    for (int i = 0; i <= size; i++)
    {
        if (ss1.Intersect(ss2).Skip(thresHold1 - 1).Any())
            matchcount++;
        if (ss1.Intersect(ss2).Skip(thresHold2 - 1).Any())
            matchcount++;
    }
    System.Diagnostics.Debug.WriteLine("sw.ms {0}   count {1}", sw.ElapsedMilliseconds.ToString("N0"), matchcount.ToString("N0"));
    sw.Stop();

}
public static bool CompareSS (IEnumerable<Int32> ss1, IEnumerable<Int32> ss2, Int32 threshold) 
{
    //System.Diagnostics.Debug.WriteLine("threshold {0}", threshold);
    using (var cursor1 = ss1.GetEnumerator())
    using (var cursor2 = ss2.GetEnumerator())
    {
        if (!cursor1.MoveNext() || !cursor2.MoveNext())
        {
            return false;
        }
        Int32 int1 = cursor1.Current;
        Int32 int2 = cursor2.Current;               
        int count = 0;
        while (true)
        {
            //System.Diagnostics.Debug.WriteLine("int1 {0}   int2 {1}", int1, int2);
            int comparison = int1.CompareTo(int2);
            if (comparison < 0)
            {
                if (!cursor1.MoveNext())
                {
                    return false;
                }
                int1 = cursor1.Current;
            }
            else if (comparison > 0)
            {
                if (!cursor2.MoveNext())
                {
                    return false;
                }
                int2 = cursor2.Current;
            }
            else
            {
                count++;
                if (count >= threshold)
                    return true;
                if (!cursor1.MoveNext() || !cursor2.MoveNext())
                    return false;
                int1 = cursor1.Current;
                int2 = cursor2.Current;
            }
        }
    }
}

答案 4 :(得分:-3)

如果速度很重要,请使用以下内容。

HashSet<T> hash = new HashSet<T>(pointsA);
hash.IntersectWith(pointsB);
return hash.Count;

如果可以使用具体的集合,则不应在性能关键的情况下使用LINQ。

或者,首先尝试以集合的形式获取元素。