如何在python中为apriori算法生成k-itemset

时间:2014-10-23 13:56:52

标签: python tuples apriori

这是我第一次尝试在python中编码,而我正在实现Apriori算法。我已经生成了直到2项集,下面是我必须通过组合1项集的键来生成2项集的函数。

如何将此功能设为通用?我的意思是,通过传递字典的键和元组中所需的元素数量,算法应该使用键生成所有可能的n个数(k + 1)子集。我知道Union on sets是一种可能性,但有没有办法将元组结合起来,这本质上是字典的关键?

# generate 2-itemset candidates by joining the 1-itemset candidates
def candidate_gen(keys):
    adict={}
    for i in keys:
        for j in keys:
            #if i != j and (j,i) not in adict:
            if j>i:
        #call join procedure which will generate f(k+1) keys
        #call has_infrequent_subset --> generates all possible k+1 itemsets and checks if k itemsets are present in f(k) keys
                adict[tuple([min(i,j),max(i,j)])] = 0
    return adict

例如,如果我的初始词典如下所示:{key,value} - >值是频率

{'382': 1163, '298': 560, '248': 1087, '458': 720, 
 '118': 509,  '723': 528, '390': 1288}

我拿这本词典的键并将其传递给上面提到的candidate_gen函数 它将生成2项集的子集并给出键的输出。然后,我将密钥传递给函数,通过与原始数据库进行比较来查找频率,以获得此输出:

{('390', '723'): 65, ('118', '298'): 20, ('298', '390'): 70, ('298', '458'): 35, 
 ('248', '382'): 88, ('248', '458'): 76, ('248', '723'): 26, ('382', '723'): 203,
 ('390', '458'): 33, ('118', '458'): 26, ('458', '723'): 26, ('248', '390'): 87,
 ('118', '248'): 54, ('298', '382'): 47, ('118', '723'): 41, ('382', '390'): 413,
 ('382', '458'): 57, ('248', '298'): 64, ('118', '382'): 40, ('298', '723'): 36, 
 ('118', '390'): 52}

如何从上述键生成3项目集子集。

2 个答案:

答案 0 :(得分:1)

我认为,鉴于你的领域,你可以从python的itertools库的研究中受益匪浅。

在您的用例中,您可以直接使用itertools combinations 或将其包装在辅助函数中

from itertools import combinations
def ord_comb(l,n):
    return list(combinations(l,n))

#### TESTING ####
a = [1,2,3,4,5]
print(ord_comb(a,1))
print(ord_comb(a,5))
print(ord_comb(a,6))
print(ord_comb([],2))
print(ord_comb(a,3))

<强>输出

[(1,), (2,), (3,), (4,), (5,)]
[(1, 2, 3, 4, 5)]
[]
[]
[(1, 2, 3), (1, 2, 4), (1, 2, 5), (1, 3, 4), (1, 3, 5), (1, 4, 5), (2, 3, 4), (2, 3, 5), (2, 4, 5), (3, 4, 5)]

请注意,n - uples中元素的顺序取决于您在传递给combinations的可迭代中使用的顺序。

答案 1 :(得分:0)

此?

In [12]: [(x, y) for x in keys for y in keys if y>x]
Out[12]: 
[('382', '723'),
 ('382', '458'),
 ('382', '390'),
 ('458', '723'),
 ('298', '382'),
 ('298', '723'),
 ('298', '458'),
 ('298', '390'),
 ('390', '723'),
 ('390', '458'),
 ('248', '382'),
 ('248', '723'),
 ('248', '458'),
 ('248', '298'),
 ('248', '390'),
 ('118', '382'),
 ('118', '723'),
 ('118', '458'),
 ('118', '298'),
 ('118', '390'),
 ('118', '248')]