我有一个List<Keyword>
,其中Keyword类是:
public string keyword;
public List<int> ids;
public int hidden;
public int live;
public bool worked;
Keyword有自己的关键字,一组20个ID,默认设置为1,隐藏为0。
我只需要迭代整个主List来使那些相同ID数大于6的关键字无效,所以比较每一对,如果第二个关于第一个重复6个以上,则隐藏的是设为1并活到0。
该算法非常基本,但主列表中包含许多元素时需要很长时间。
我试图猜测是否可以使用任何方法来提高速度。
我使用的基本算法是:
foreach (Keyword main_keyword in lista_de_keywords_live)
{
if (main_keyword.worked) {
continue;
}
foreach (Keyword keyword_to_compare in lista_de_keywords_live)
{
if (keyword_to_compare.worked || keyword_to_compare.id == main_keyword.id) continue;
n_ids_same = 0;
foreach (int id in main_keyword.ids)
{
if (keyword_to_compare._lista_models.IndexOf(id) >= 0)
{
if (++n_ids_same >= 6) break;
}
}
if (n_ids_same >= 6)
{
keyword_to_compare.hidden = 1;
keyword_to_compare.live = 0;
keyword_to_compare.worked = true;
}
}
}
答案 0 :(得分:2)
The code below is an example of how you would use a HashSet
for your problem. However, I would not recommend using it in this scenario. On the other hand, the idea of sorting the ids to make the comparison faster still.
Run it in a Console Project to try it out.
Notice that once I'm done adding new ids to a keyword, I sort them. This makes the comparison faster later on.
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
namespace KeywordExample
{
public class Keyword
{
public List<int> ids;
public int hidden;
public int live;
public bool worked;
public Keyword()
{
ids = new List<int>();
hidden = 0;
live = 1;
worked = false;
}
public override string ToString()
{
StringBuilder s = new StringBuilder();
if (ids.Count > 0)
{
s.Append(ids[0]);
for (int i = 1; i < ids.Count; i++)
{
s.Append(',' + ids[i].ToString());
}
}
return s.ToString();
}
}
public class KeywordComparer : EqualityComparer<Keyword>
{
public override bool Equals(Keyword k1, Keyword k2)
{
int equals = 0;
int i = 0;
int j = 0;
//based on sorted ids
while (i < k1.ids.Count && j < k2.ids.Count)
{
if (k1.ids[i] < k2.ids[j])
{
i++;
}
else if (k1.ids[i] > k2.ids[j])
{
j++;
}
else
{
equals++;
i++;
j++;
}
}
return equals >= 6;
}
public override int GetHashCode(Keyword keyword)
{
return 0;//notice that using the same hash for all keywords gives you an O(n^2) time complexity though.
}
}
class Program
{
static void Main(string[] args)
{
List<Keyword> listOfKeywordsLive = new List<Keyword>();
//add some values
Random random = new Random();
int n = 10;
int sizeOfMaxId = 20;
for (int i = 0; i < n; i++)
{
var newKeyword = new Keyword();
for (int j = 0; j < 20; j++)
{
newKeyword.ids.Add(random.Next(sizeOfMaxId) + 1);
}
newKeyword.ids.Sort(); //sorting the ids
listOfKeywordsLive.Add(newKeyword);
}
//solution here
HashSet<Keyword> set = new HashSet<Keyword>(new KeywordComparer());
set.Add(listOfKeywordsLive[0]);
for (int i = 1; i < listOfKeywordsLive.Count; i++)
{
Keyword keywordToCompare = listOfKeywordsLive[i];
if (!set.Add(keywordToCompare))
{
keywordToCompare.hidden = 1;
keywordToCompare.live = 0;
keywordToCompare.worked = true;
}
}
//print all keywords to check
Console.WriteLine(set.Count + "/" + n + " inserted");
foreach (var keyword in set)
{
Console.WriteLine(keyword);
}
}
}
}
答案 1 :(得分:1)
效率低下的明显原因是你计算两个(ids)列表的交集方式。算法是O(n ^ 2)。这就是关系数据库为每个连接解决的问题,您的方法将被称为循环连接。主要的有效策略是散列连接和合并连接。对于您的场景,我猜想后一种方法可能会更好,但如果您愿意,也可以尝试使用HashSets。
效率低下的第二个原因是重复所有事情两次。由于(a连接b)等于(b join a),因此整个List<Keyword>
不需要两个周期。实际上,你只需要遍历非重复的那些。
使用here中的一些代码,您可以编写如下算法:
Parallel.ForEach(list, k => k.ids.Sort());
List<Keyword> result = new List<Keyword>();
foreach (var k in list)
{
if (result.Any(r => r.ids.IntersectSorted(k.ids, Comparer<int>.Default)
.Skip(5)
.Any()))
{
k.hidden = 1;
k.live = 0;
k.worked = true;
}
else
{
result.Add(k);
}
}
如果仅使用索引操作方法替换linq(请参阅上面的链接),我猜它会快一点。