我有2个集合,我想确定交叉元素的数量是否超过某个阈值。
我目前使用此代码(执行约8500万次,因此速度 重要):
public bool isSimilarTo(....)
int numberOfSharedPoints = pointsA.Count(pointsB.Contains);
if (numberOfSharedPoints >= THRESHOLD) return true;
由于必须首先计算numberOfSharedPoints
,因此可能效率低下。
是否有更优化的方法,例如,在达到阈值时,使用break
快捷方式迭代元素?
奖金问题:
this.pointsA.Intersect(pointsB).Count()
会更快吗?List<>
- Hashset
会更快吗?答案 0 :(得分:4)
要确定交叉点的项目数是否多于THRESHOLD
,您可以使用此构造:
if (pointsA.Intersect(pointsB).Skip(THRESHOLD - 1).Any())
{
//...
}
正如Rawling在另一个答案的评论中指出的那样,Intersect
将完全枚举第二个序列。因此,此解决方案的复杂性似乎是O(n + m)
- n
和m
是pointsA
和pointsB
集合中的项目数量。 O(m)
是构建HashSet
的代价 - 所以我假设这种结构在内部使用。检查元素是否在哈希集内是恒定时间(在注释中由Ilya Ivanov指出),并且对于最坏情况场景,它最多执行m
次(例如:交叉点是空的,需要检查所有元素。)
此外,如果您拥有具有恒定时间Count
的具体集合,则可以尝试以下优化,如果它们的大小可能显着不同:
var shorter = pointsA;
var longer = pointsB;
//makes sense if Count() is constant time
if (shorter.Count() > longer.Count())
{
shorter = pointsB;
longer = pointsA;
}
if (longer.Intersect(shorter).Skip(THRESHOLD - 1).Any())
{
//...
}
答案 1 :(得分:2)
我已经创建了一个示例来查找此处给出的每个答案的效果,包括传统的foreach
循环:
在我的示例控制台应用程序中,我为pointsA
和pointsB
生成了10,000个随机浮点数。
阈值计数为100,并检查每种方法的性能,使用以下代码:
static void Main(string[] args)
{
double totalTimeSpentIntersectAndSkip = 0;
double totalTimeSpentHashSet = 0;
double totalTimeSpentCount = 0;
double totalTimeSpentWhereAndSkip = 0;
double totalTimeSpentForEach = 0;
int maxIteration = 1000;
for (int j = 0; j < maxIteration; j++)
{
Random r = new Random();
for (int i = 0; i < 10000; i++)
{
pointsA.Add(r.NextDouble());
}
for (int i = 0; i < 10000; i++)
{
pointsB.Add(r.NextDouble());
}
s.Reset(); s.Start();
var timeSpentInSeconds = TestUsingIntersectAndSkip();
s.Stop();
Console.WriteLine("IntersectAndSkip: " + timeSpentInSeconds);
totalTimeSpentIntersectAndSkip += timeSpentInSeconds;
s.Reset(); s.Start();
timeSpentInSeconds = TestUsingHashSet();
s.Stop();
Console.WriteLine("HashSet: " + timeSpentInSeconds);
totalTimeSpentHashSet += timeSpentInSeconds;
s.Reset(); s.Start();
timeSpentInSeconds = TestUsingForEach();
s.Stop();
Console.WriteLine("ForEach: " + timeSpentInSeconds);
totalTimeSpentForEach += timeSpentInSeconds;
s.Reset(); s.Start();
timeSpentInSeconds = TestUsingWhereAndSkip();
s.Stop();
Console.WriteLine("WhereAndSkip: " + timeSpentInSeconds);
totalTimeSpentWhereAndSkip += timeSpentInSeconds;
s.Reset(); s.Start();
timeSpentInSeconds = TestUsingCount();
s.Stop();
Console.WriteLine("Count: " + timeSpentInSeconds);
totalTimeSpentCount += timeSpentInSeconds;
Console.WriteLine("-------------------------------------------------------------------------------");
pointsA.Clear();
pointsB.Clear();
}
Console.WriteLine("Following is Average TimeSpent by each method: "+Environment.NewLine);
Console.WriteLine("IntersectAndSkip: " + totalTimeSpentIntersectAndSkip / maxIteration);
Console.WriteLine("HashSet: " + totalTimeSpentHashSet / maxIteration);
Console.WriteLine("ForEach: " + totalTimeSpentForEach / maxIteration);
Console.WriteLine("WhereAndSkip: " + totalTimeSpentWhereAndSkip / maxIteration);
Console.WriteLine("Count: " + totalTimeSpentCount / maxIteration);
Console.WriteLine("-------------------------------------------------------------------------------");
}
static Stopwatch s = new Stopwatch();
const int THRESHOLD = 100;
static List<Double> pointsA = new List<double>();
static List<Double> pointsB = new List<double>();
private static double TestUsingHashSet()
{
HashSet<double> hash = new HashSet<double>(pointsA);
hash.IntersectWith(pointsB);
if (hash.Count >= THRESHOLD)
{
return s.Elapsed.TotalSeconds;
}
else
{
return s.Elapsed.TotalSeconds;
}
}
private static double TestUsingWhereAndSkip()
{
if (pointsA.Where(pointsB.Contains).Skip(THRESHOLD - 1).Any())
{
return s.Elapsed.TotalSeconds;
}
else
{
return s.Elapsed.TotalSeconds;
}
}
private static double TestUsingCount()
{
int numberOfSharedPoints = pointsA.Count(pointsB.Contains);
if (numberOfSharedPoints >= THRESHOLD)
{
return s.Elapsed.TotalSeconds;
}
else
{
return s.Elapsed.TotalSeconds;
}
}
private static double TestUsingForEach()
{
var intersectItemCount = 0;
foreach (var d in pointsA)
{
if (pointsB.Contains(d)) intersectItemCount++;
if (intersectItemCount > THRESHOLD)
{
return s.Elapsed.TotalSeconds;
}
}
return s.Elapsed.TotalSeconds;
}
private static double TestUsingIntersectAndSkip()
{
if (pointsA.Intersect(pointsB).Skip(THRESHOLD - 1).Any())
{
return s.Elapsed.TotalSeconds;
}
else
{
return s.Elapsed.TotalSeconds;
}
}
我运行1000次并存储每次迭代的结果和平均结果,之后所有这些分析都是按性能排名:
1) Intersect with Skip
2) HashSet
3) Count (Given by OP)
4) Where and Skip
5) Foreach
将项目数从10,000更改为50,000(5次运行)时,除HashSet
和IntersectWithSkip外,所有时间都过长。表现排名几乎相同:
答案 2 :(得分:0)
试试这个:
if(pointsA.Where(pointsB.Contains).Skip(THRESHOLD-1).Any()){
//...
}
答案 3 :(得分:0)
我认为你可以通过排序的IEnumerable获得O(n),因为你只需要进行一次传递
这是Integer,但您可以使用泛型并传递比较器
在我的测试中,下面的CompareSS击败了ss1.Intersect(ss2).Skip(thresHold1 - 1).Any()by 5:1
是的5倍
使用两个HashSet和一个ForEach Contains,你也可以进行O(n)比较,但是HashSet的创建成本更高。
public static void TimeTest()
{
int size = 20000;
List<Int32>ss1 = new List<int>(size);
List<Int32>ss2 = new List<int>(size);
for(int i = 0; i < size; i++)
{
Int32 int1 = i;
Int32 int2 = i + (Int32)((float)size / 2);
ss1.Add(i);
ss2.Add(int2);
}
//foreach (int iTest in ss1)
// System.Diagnostics.Debug.WriteLine(iTest);
//System.Diagnostics.Debug.WriteLine("");
//foreach (int iTest in ss2)
// System.Diagnostics.Debug.WriteLine(iTest);
System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
sw.Start();
int thresHold1 = (Int32)((float)size / 4);
int thresHold2 = (Int32)((float)size * 3 / 4);
Int32 matchcount = 0;
for(int i = 0; i <= size; i++)
{
if(CompareSS(ss1, ss2, thresHold1))
matchcount++;
if (CompareSS(ss1, ss2, thresHold2))
matchcount++;
}
System.Diagnostics.Debug.WriteLine("sw.ms {0} count {1}", sw.ElapsedMilliseconds.ToString("N0"), matchcount.ToString("N0"));
sw.Restart();
matchcount = 0;
for (int i = 0; i <= size; i++)
{
if (ss1.Intersect(ss2).Skip(thresHold1 - 1).Any())
matchcount++;
if (ss1.Intersect(ss2).Skip(thresHold2 - 1).Any())
matchcount++;
}
System.Diagnostics.Debug.WriteLine("sw.ms {0} count {1}", sw.ElapsedMilliseconds.ToString("N0"), matchcount.ToString("N0"));
sw.Stop();
}
public static bool CompareSS (IEnumerable<Int32> ss1, IEnumerable<Int32> ss2, Int32 threshold)
{
//System.Diagnostics.Debug.WriteLine("threshold {0}", threshold);
using (var cursor1 = ss1.GetEnumerator())
using (var cursor2 = ss2.GetEnumerator())
{
if (!cursor1.MoveNext() || !cursor2.MoveNext())
{
return false;
}
Int32 int1 = cursor1.Current;
Int32 int2 = cursor2.Current;
int count = 0;
while (true)
{
//System.Diagnostics.Debug.WriteLine("int1 {0} int2 {1}", int1, int2);
int comparison = int1.CompareTo(int2);
if (comparison < 0)
{
if (!cursor1.MoveNext())
{
return false;
}
int1 = cursor1.Current;
}
else if (comparison > 0)
{
if (!cursor2.MoveNext())
{
return false;
}
int2 = cursor2.Current;
}
else
{
count++;
if (count >= threshold)
return true;
if (!cursor1.MoveNext() || !cursor2.MoveNext())
return false;
int1 = cursor1.Current;
int2 = cursor2.Current;
}
}
}
}
答案 4 :(得分:-3)
如果速度很重要,请使用以下内容。
HashSet<T> hash = new HashSet<T>(pointsA);
hash.IntersectWith(pointsB);
return hash.Count;
如果可以使用具体的集合,则不应在性能关键的情况下使用LINQ。
或者,首先尝试以集合的形式获取元素。