df = pd.Series([["26"], ["81", "15", "27"], ["50"], ["8"], ["81", "15"],
["10"], ["81"]]).to_frame(name='itemsets')
itemsets
0 [26]
1 [81, 15, 27]
2 [50]
3 [8]
4 [81, 15]
5 [10]
6 [81]
rule = [["81"],["15"]]
我尝试了此操作,但速度很慢,因为数据集有900万行,因此我寻找一种有效的解决方案来执行此代码,而该解决方案需要4秒钟以上的时间,这是将数据帧转换为numpy的一种方式数组或执行此代码的所有内容?
def support(rule):
rule = rule[0] + rule[1]
support = 0
support = sum([set(rule)<=set(row) for row in df])
return support/SIZE
when i try it on the largest dataset which contains 9 million transactions the result is :
support(rule)
0.001039247773829178
想法是计算数据框中非严格子集的出现次数
答案 0 :(得分:2)
IIUC:
将rule
定义为
rule = ['81', '15']
df.itemsets.apply(set).le(set(rule)).mean()
0.2857142857142857
应该加快步伐
def support(rule, series):
n = len(series)
i = np.arange(n).repeat(series.str.len())
out = np.ones(n, bool)
np.logical_and.at(out, i, np.in1d(np.concatenate(series), rule))
return out.mean()
support(rule, df.itemsets)
0.2857142857142857