我正在尝试实现ECLAT算法来进行关联规则挖掘/频繁模式检测。该算法依赖于一种能够快速有效地计算集合的所有组合的交点并返回任何具有len(intersection)> min_support
的交点的算法。我有一些示例数据正在测试中。我用以下代码生成它:
import pandas as pd
import numpy as np
%load_ext Cython
%load_ext line_profiler
%load_ext memory_profiler
df = pd.DataFrame([[{"bread"}, {1,4,5,7,8,9}], [{"butter"}, {1,2,3,4,6,8,9}], [{"milk"}, {3,5,6,7,8,9}], [{"coke"}, {2,4}], [{"jam"}, {1,8}]],columns=['food_item','TID'])
data_dict2 = df.to_dict(orient='index')
我已经实现了这样的算法。
from itertools import combinations, starmap
def eclat(data_dictionary, min_support, num_iterations):
tid_list = []
food_item_list = []
frequent_item_list = []
for elem in data_dictionary:
tid_list.append(data_dictionary[elem]['TID'])
food_item_list.append(data_dictionary[elem]['food_item'])
for i in range(1, num_iterations):
all_intersections = starmap(set.intersection, combinations(tid_list, i))
all_food_intersections = combinations(food_item_list, i)
food_intersection_list = list(all_food_intersections)
for index, intersection in enumerate(all_intersections):
if len(intersection) >= min_support:
#print(index, sorted(intersection))
#print(food_intersection_list[index])
info = {"food_item": food_intersection_list[index],
"TID": intersection}
frequent_item_list.append(info)
#return sorted(frequent_item_list, key=lambda k: len(k['food_item']), reverse=True) #sorting unnecessary, doing it to show me the "largest" rules first
return(frequent_item_list)
%lprun -f eclat eclat(data_dict2, 2, 5)
这里有一些代码可以模拟我提供的任意大小格式的数据,以便于分析我的代码。我已经使用Jupyters Ipython交互式笔记本编写了所有内容,因此它利用了Jupyter笔记本内置的“魔术”
random_num_set_list = []
for i in range(0, 100): # generate 100 random sets
random_array = np.random.randint(0,1000,(100)) # generate a list of 100 TID's with possible numbers between 0 and 1000
random_set = set(random_array) #set repersentation makes things easier for my code AND makes sense since we should never have duplicate TIDs
#sets ALSO are neat because now they're random length which is desierable in a testing function
random_num_set_list.append(random_set)
rand_item_list = list(pd.util.testing.rands_array(8, 100)) # generate 100 8 character strings in a list
set_list = list(map(lambda x: {x}, rand_item_list))
df1 = pd.DataFrame({'food_item': set_list, 'TID': random_num_set_list})
data_dict3 = df1.to_dict(orient='index')
%lprun -f eclat eclat(data_dict3, 7, 4) # also can do #memit
Here's an example of the line profiler output for eclat(data_dict3, 7,4)
即使在这些数据大小下,我在执行过程中也已经占用了1.6 gigs RAM。太大了
我的问题是:我该怎么做才能优化此算法?
如果可能,我愿意使用Numba或Cython(TID永远不会为负,因此告诉Cython仅存储未签名的int集应该可以大大减少我的内存占用)。我遇到的另一个想法是找到一种无需首先计算每个交集的索引即可运行该算法的方法。
在此先感谢您提供的任何帮助!