Question

我有data_file 88k rows

前五行data_file

data_file[:5]
Out[8]: 
   col_1  col_2  col_3  col_4  col_5  col_6  col_7  col_8  col_9  col_10  \
0      1      2      3      4      5      6      7      8      9      10   
1     31     32     33      0      0      0      0      0      0       0   
2     34     35     36      0      0      0      0      0      0       0   
3     37     38     39     40     41     42     43     44     45      46   
4     39     40     48     49      0      0      0      0      0       0   

   col_67  col_68  col_69  col_70  col_71  col_72  col_73  col_74  \
0   ...         0       0       0       0       0       0       0       0   
1   ...         0       0       0       0       0       0       0       0   
2   ...         0       0       0       0       0       0       0       0   
3   ...         0       0       0       0       0       0       0       0   
4   ...         0       0       0       0       0       0       0       0   

   col_75  col_76  
0       0       0  
1       0       0  
2       0       0  
3       0       0  
4       0       0

data_file有88k行和76列，每行的值可以在(0-1000)之间，大多数是0's。此矩阵表示88k个事务，每个事务/行表示已购买的项目。

Ex: `2nd-transaction` has items 31,32,33 items amongst 1000 total possible items

 `3rd-transaction` has 34,35,36 items ..and so on..

现在计算freq_oneItemSet，我们计算每个transaction/row中所有（1000）项的出现次数。然后考虑大于supp_count的那些。所以很简单：

#%%  Compute support count
supp_thresh = 0.02; T_IDS = len(data_file); 
supp_count = math.floor(supp_thresh*T_IDS); 
print("\n Supp_Thresh: ", supp_thresh,
      ";  T_ID's: ", T_IDS, ";  Supp_count: ", supp_count)              
#%% Get frequent one_itemset
print('---Frequent one-ItemSet Generation---')
allUniq_items, allitem_counts = np.unique(data_file, return_counts=True)
allUniq_items = np.delete(allUniq_items,0) # remove added 0's
allitem_counts = np.delete(allitem_counts,0) # remove added 0's
freq_oneItemSet = allUniq_items[allitem_counts >= supp_count]

输出：Frequent_oneItemSet，支持计数和其他细节：

Support-Count:  1763
Number of frequent one-items:  20
Frequent one itemset, Frequent one itemcounts:
[[   33 15167]
 [   37  2936]
 [   39 15596]
 [   40 50675]
 [   42 14945]
 [   49 42135]
 [   66  4472]
 [   90  3837]
 [  102  2237]
 [  111  2794]
 [  148  1779]
 [  171  3099]
 [  226  3257]
 [  238  3032]
 [  272  2094]
 [  311  2594]
 [  414  1880]
 [  439  1863]
 [  476  2167]
 [ 1328  1786]]
------------------------------------

现在要生成两个候选项，我看到2-combinations freq_oneItemSet可能supp_count检查主data_file的每一行并计算出现次数并保存，如果该组合的计数大于{{ 1}}

示例：

对于2-combination的{{1}} - [33,39]，我们会计算freq_oneItemSet的数量，其中包含rows (88K)这两个项目的所有可能组合并考虑比大于[33,39]的组合。

supp_count

我的以下代码输出：

 `[33,39] occurred in 2833 rows`,
 `[33,40] occurred in 8455 rows`

我已经编写了这个代码，它生成了这190个组合，并检查它是否大于支持计数，如果这样将它附加到频繁的两个项目集。

执行需要32秒才能改善这段时间吗？

---Frequent two-candidate Generation---
------------------------------------
---Frequent two-ItemSet---
Support-Count:  1763
Number of frequent two-items:  22
freq_two-ItemSet, freq_two-ItemCount
[[   33    39  2833]
 [   33    40  8455]
 [   33    42  3196]
 [   33    49  8034]
 [   37    39  2790]
 [   37    40  2037]
 [   39    40 10345]
 [   39    42  3897]
 [   39    49  7944]
 [   39   111  2725]
 [   39   171  3031]
 [   40    42 11414]
 [   40    49 29142]
 [   40    66  2787]
 [   40    90  2749]
 [   40   171  2059]
 [   40   226  2351]
 [   40   238  1929]
 [   40   311  1852]
 [   42    49  9018]
 [   49    66  2529]
 [   49    90  2798]]
------------------------------------

Answer 1

这里设定的方法似乎更有效，因为你的矩阵看起来非常稀疏。

首先计算集合：

frequents=set(freq_oneItemSet) 
sets = [set(row)&frequents for _,row in data_file.iterrows()]

现在成对：

c=collections.Counter()
for s in sets :
    for pair in itertools.combinations(s,2):
        c[pair]+=1

我希望这会产生很少的流量，因为设置会很薄。

关于这个（不是稀疏的）例子：

data_file=pd.DataFrame(randint(0,1000,(88000,76)))
frequents=set(range(20))

我的电脑需要5秒钟。

但纯粹的矢量在这个例子上仍然更好：

def g(data_file,frequents):       
    is_in=np.equal.outer(frequents,data_file).any(axis=2)
    first,second,_=np.where(np.logical_and(is_in[:,None],is_in[None]))  
    cp,counts= np.unique(first+1j*second,return_counts=True)
    xp , yp = cp.real.astype(int),cp.imag.astype(int)
    list_ = [((frequents[x],frequents[y]),count) \
    for (x,y,count) in zip(xp,yp,counts) if x<y ]
    return  list_

我使用复杂来简化计数。在您的计划中，您计算is_in ...约40次：

In [511]: %time s=g(data_file,sorted(np.random.choice(range(1000),20)))
Wall time: 483 ms

In [512]: len(s)
Out[512]: 190

对此循环进行矢量化以更快地比较所有可能的组合

1 个答案: