我有data_file
88k rows
前五行data_file
data_file[:5]
Out[8]:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10 \
0 1 2 3 4 5 6 7 8 9 10
1 31 32 33 0 0 0 0 0 0 0
2 34 35 36 0 0 0 0 0 0 0
3 37 38 39 40 41 42 43 44 45 46
4 39 40 48 49 0 0 0 0 0 0
col_67 col_68 col_69 col_70 col_71 col_72 col_73 col_74 \
0 ... 0 0 0 0 0 0 0 0
1 ... 0 0 0 0 0 0 0 0
2 ... 0 0 0 0 0 0 0 0
3 ... 0 0 0 0 0 0 0 0
4 ... 0 0 0 0 0 0 0 0
col_75 col_76
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
data_file
有88k
行和76
列,每行的值可以在(0-1000)
之间,大多数是0's
。此矩阵表示88k
个事务,每个事务/行表示已购买的项目。
Ex: `2nd-transaction` has items 31,32,33 items amongst 1000 total possible items
`3rd-transaction` has 34,35,36 items ..and so on..
现在计算freq_oneItemSet
,我们计算每个transaction/row
中所有(1000)项的出现次数。然后考虑大于supp_count
的那些。所以很简单:
#%% Compute support count
supp_thresh = 0.02; T_IDS = len(data_file);
supp_count = math.floor(supp_thresh*T_IDS);
print("\n Supp_Thresh: ", supp_thresh,
"; T_ID's: ", T_IDS, "; Supp_count: ", supp_count)
#%% Get frequent one_itemset
print('---Frequent one-ItemSet Generation---')
allUniq_items, allitem_counts = np.unique(data_file, return_counts=True)
allUniq_items = np.delete(allUniq_items,0) # remove added 0's
allitem_counts = np.delete(allitem_counts,0) # remove added 0's
freq_oneItemSet = allUniq_items[allitem_counts >= supp_count]
输出:Frequent_oneItemSet,支持计数和其他细节:
Support-Count: 1763
Number of frequent one-items: 20
Frequent one itemset, Frequent one itemcounts:
[[ 33 15167]
[ 37 2936]
[ 39 15596]
[ 40 50675]
[ 42 14945]
[ 49 42135]
[ 66 4472]
[ 90 3837]
[ 102 2237]
[ 111 2794]
[ 148 1779]
[ 171 3099]
[ 226 3257]
[ 238 3032]
[ 272 2094]
[ 311 2594]
[ 414 1880]
[ 439 1863]
[ 476 2167]
[ 1328 1786]]
------------------------------------
现在要生成两个候选项,我看到2-combinations
freq_oneItemSet
可能supp_count
检查主data_file的每一行并计算出现次数并保存,如果该组合的计数大于{{ 1}}
示例:
对于2-combination
的{{1}} - [33,39],我们会计算freq_oneItemSet
的数量,其中包含rows (88K)
这两个项目的所有可能组合并考虑比大于[33,39]
的组合。
supp_count
我的以下代码输出:
`[33,39] occurred in 2833 rows`,
`[33,40] occurred in 8455 rows`
我已经编写了这个代码,它生成了这190个组合,并检查它是否大于支持计数,如果这样将它附加到频繁的两个项目集。
执行需要32秒才能改善这段时间吗?
---Frequent two-candidate Generation---
------------------------------------
---Frequent two-ItemSet---
Support-Count: 1763
Number of frequent two-items: 22
freq_two-ItemSet, freq_two-ItemCount
[[ 33 39 2833]
[ 33 40 8455]
[ 33 42 3196]
[ 33 49 8034]
[ 37 39 2790]
[ 37 40 2037]
[ 39 40 10345]
[ 39 42 3897]
[ 39 49 7944]
[ 39 111 2725]
[ 39 171 3031]
[ 40 42 11414]
[ 40 49 29142]
[ 40 66 2787]
[ 40 90 2749]
[ 40 171 2059]
[ 40 226 2351]
[ 40 238 1929]
[ 40 311 1852]
[ 42 49 9018]
[ 49 66 2529]
[ 49 90 2798]]
------------------------------------
答案 0 :(得分:2)
这里设定的方法似乎更有效,因为你的矩阵看起来非常稀疏。
首先计算集合:
frequents=set(freq_oneItemSet)
sets = [set(row)&frequents for _,row in data_file.iterrows()]
现在成对:
c=collections.Counter()
for s in sets :
for pair in itertools.combinations(s,2):
c[pair]+=1
我希望这会产生很少的流量,因为设置会很薄。
关于这个(不是稀疏的)例子:
data_file=pd.DataFrame(randint(0,1000,(88000,76)))
frequents=set(range(20))
我的电脑需要5秒钟。
但纯粹的矢量在这个例子上仍然更好:
def g(data_file,frequents):
is_in=np.equal.outer(frequents,data_file).any(axis=2)
first,second,_=np.where(np.logical_and(is_in[:,None],is_in[None]))
cp,counts= np.unique(first+1j*second,return_counts=True)
xp , yp = cp.real.astype(int),cp.imag.astype(int)
list_ = [((frequents[x],frequents[y]),count) \
for (x,y,count) in zip(xp,yp,counts) if x<y ]
return list_
我使用复杂来简化计数。在您的计划中,您计算is_in
...约40次:
In [511]: %time s=g(data_file,sorted(np.random.choice(range(1000),20)))
Wall time: 483 ms
In [512]: len(s)
Out[512]: 190