计算大型数据集中规则的支持

时间:2018-08-06 19:15:16

标签: python pandas function numpy optimization

df = pd.Series([["26"], ["81", "15", "27"], ["50"], ["8"], ["81", "15"], 
["10"], ["81"]]).to_frame(name='itemsets')


       itemsets
0          [26]
1  [81, 15, 27]
2          [50]
3           [8]
4      [81, 15]
5          [10]
6          [81]

rule = [["81"],["15"]]

我尝试了此操作,但速度很慢,因为数据集有900万行,因此我寻找一种有效的解决方案来执行此代码,而该解决方案需要4秒钟以上的时间,这是将数据帧转换为numpy的一种方式数组或执行此代码的所有内容?

def support(rule):
    rule = rule[0] + rule[1]
    support = 0
    support = sum([set(rule)<=set(row) for row in df])
return support/SIZE

when i try it on the largest dataset which contains 9 million transactions the result is :

support(rule)
0.001039247773829178

想法是计算数据框中非严格子集的出现次数

1 个答案:

答案 0 :(得分:2)

IIUC:

rule定义为

rule = ['81', '15']

df.itemsets.apply(set).le(set(rule)).mean()

0.2857142857142857

脾气暴躁的人

应该加快步伐

def support(rule, series):  
    n = len(series)  
    i = np.arange(n).repeat(series.str.len())
    out = np.ones(n, bool)
    np.logical_and.at(out, i, np.in1d(np.concatenate(series), rule))
    return out.mean()

support(rule, df.itemsets)

0.2857142857142857