Question

我正在研究数据挖掘项目，我选择了Apriori算法用于关联规则任务。简单地说我对执行时间的执行时间并不满意。我将描述我的代码中有问题的部分。

我有两份清单。

List<List<int>> one;

List<List<int>> two;

我要遍历列表one的元素并检查one[i]是two[j]的子集

foreach(List<int> items in one)
{

    foreach(List<int> items2 in two)
    {

        if(items2.ContainsSetOf(items1))
        {
            //do something
        }
}

我在想是否有办法减少这种apporoach的执行时间。（并行执行，使用词典等）

你们有什么想法可以减少它吗？

谢谢！

Answer 1

制作集合列表，并使用集合操作查找另一组的子集。

示例

HashSet<int> set1 = new HashSet<int>(); set1.Add(1); set1.Add(2); HashSet<int> set2 = new HashSet<int>(); set2.Add(1); set2.Add(2); set2.Add(3); List<HashSet<int>> one = new List<HashSet<int>>(); one.add(set1); one.add(set2); List<HashSet<int>> two = new List<HashSet<int>>(); two.add(set1); two.add(set2); foreach(Set<int> setA in one) { foreach(Set<int> setB in two) { if(setA.IsSubsetOf(setB)) { // do something } } }

Answer 2

如果要减少“列表中的列表”（或设置为子集）的检查次数，一种方法是构建列表的层次结构（树）。当然，性能改进（如果有的话）取决于数据 - 如果没有列表包含其他列表，您将不得不像现在一样进行所有检查。

Answer 3

C＃代码段

var dict = new Dictionary<int, HashSet<List<int>>>();

foreach (List<int> list2 in two) {
   foreach (int i in list2) {
      if(dict.ContainsKey(i) == FALSE) {
         //create empty HashSet dict[i]
         dict.Add(i, new HashSet<List<int>>());
      }
      //add reference to list2 to the HashSet dict[i]
      dict[i].Add(list2); 
   }
}

foreach (List<int> list1 in one) {
   HashSet<List<int>> listsInTwoContainingList1 = null;
   foreach (int i in list1) {
      if (listsInTwoContainingList1 == null) {
         listsInTwoContainingList1 = new HashSet<List<int>>(dict[i]);
      } else {
         listsInTwoContainingList1.IntersectWith(dict[i]);
      }
      if(listsInTwoContainingList1.Count == 0) {   //optimization :p
         break;
      }
   }
   foreach (List<int> list2 in listsInTwoContainingList1) {
      //list2 contains list1
      //do something
   }   
}

示例

L2= { L2a = {10, 20, 30, 40} L2b = {30, 40, 50, 60} L2c = {10, 25, 30, 40} } L1 = { L1a = {10, 30, 40} L1b = {30, 25, 50} }

在代码的第一部分之后：

dict[10] = {L2a, L2c} dict[20] = {L2a} dict[25] = {L2c} dict[30] = {L2a, L2b, L2c} dict[40] = {L2a, L2b, L2c} dict[50] = {L2c} dict[60] = {L2c}

在代码的第二部分：

L1a: dict[10] n dict[30] n dict[40] = {L2a, L2c} L1b: dict[30] n dict[25] n dict[50] = { }

因此L1a和L2a中包含L2c，但L1b中没有L1。

<强>复杂性

现在关于算法的复杂性，假设n1有L2个元素，n2有L1个元素，m1子列表的平均元素个数是L2，m2的子列表的平均元素数是O(n1 x n2 x m1 x m2)。然后：

原始解决方案是： O(n1 x n2 x (m1 + m2))，如果 containsSetOf 方法执行嵌套循环，或者最好是O(n1 x n2 x (m1 + m2))，如果它使用HashSet。 Is7aq的解决方案也是O(n2 x m2 + n1 x (m1 x nd + n2))。

建议的解决方案是： nd，其中dict[i]是集nd的平均元素数。

建议的解决方案的效率很大程度上取决于nd：

如果n2很大 - 接近L2（当每个整数都是nd的每个子列表的一部分时），则它与原始子列表一样慢。

但是，如果预计L2很小（即n1的子列表彼此差异很大），那么建议的解决方案通常要快得多，尤其是{ {1}}和n2很大。

改善算法执行时间

3 个答案: