这是与我的家庭作业有关的第二篇文章,试图在python中编写apriori算法 - 请参阅Python - load transaction data into a list of lists, count occurrence of each string。我收到的帮助更快地加载我的数据并计算每个项目出现在我的数据集中的次数。我的代码的下一部分运行得非常慢,而且我不太熟悉python函数的最佳使用来加速这些操作,因此你会看到我非常依赖for循环和if case。这是我的代码,其中包含一个列表和一个字典,这些列表和字典是在代码中先前创建的,只需复制使用:
# a dict and a list that are built earlier
item_data_lol = [['A B C D E F'], ['A E F G H I J K'], ['A B D E F G H'], ['B C D F G H'], ['G H I K J'], ['G H I J'], ['B C D H J K'], ['B C D H K'], ['A C E G I K'], ['A B D F G H I'], ['A B C D E F G H I J K'], ['A B C D E'], ['C D F G'], ['C E F G H I'], ['C D E J K'], ['J K'], ['G H I J K'], ['A B D'], ['A C D K'], ['A B D I J K'], ['A B C E F G'], ['F G I J K'], ['A F G K'], ['B C E F G H'], ['A D E'], ['A B'], ['C D E F'], ['C E F G H I J'], ['I J K'], ['E F H I J K']]
first_lookup = collections.Counter(item for line in item_data_lol for item in line[0].split())
frequent_items = ['A', 'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J']
本质上,item_data_lol是一个交易列表,其中字母表示正在购买的特定产品。我试图找到经常一起购买的产品对,我只考虑属于frequent_items列表的成对产品。例如,第一笔交易是A B C D E F,表明这6种产品全部一起购买。这是我到目前为止所拥有的。
# initialize second dict to count frequency of pairs of items
second_lookup = {}
# loop over each pair in frequent_tuples, creating a key/value pair in the dict for them
n = len(frequent_items)
for i in range(n):
for j in range(n):
item_1 = frequent_items[i]
item_2 = frequent_items[j]
if item_1 < item_2:
this_key = (item_1, item_2)
second_lookup[this_key] = 0
# loop through each row of the data again, create all possible combinations of pairs
# check if each pair is a key in second_lookup, if so increment the value by 1
for line in item_data_lol:
line = line[0]
# nested for loop over the row, needed to create tuple pairs for all items
for item_1 in line.split():
for item_2 in line.split():
# check that the items aren't the same, then created a sorted tuple
if item_1 < item_2:
test_key = (item_1, item_2)
if test_key in second_lookup.keys():
second_lookup[test_key] += 1
# filter second_lookup down to only those tuples/pairs with > support_threshold count
frequent_pairs = []
for this_key, this_value in second_lookup.iteritems():
if this_value > support_threshold:
frequent_pairs.append(this_key)
我的策略很简单,但很慢。我首先初始化second_lookup字典,并在字典中创建对应于frequent_items列表中存在的每个可能的2个产品对的密钥。然后我遍历我的数据(item_data_lol),对于每一行/事务,我创建两个项目的每个组合(对于第一行,这将是(A,B),(A,C),(A,D) ,(A,E),(A,F),(B,C),(B,D),(B,E),(B,F),(C,D)......)。然后我检查这些对中的每一对是否是second_lookup字典中的一个键,以及它是否将该键值增加1。
最终,这个过程非常缓慢。它在我的测试数据上以适当的速度工作,但在较大的数据集上没有。任何想法都表示赞赏!
编辑 - 在我的if案例中,我低估了删除.keys()所带来的速度提升量。调用dict.keys()时。它似乎已经解决了这个问题。答案 0 :(得分:0)
这是一个做类似事情的单行
from collections import Counter
import itertools as it
c = Counter(it.chain.from_iterable(map(lambda x: it.combinations(x, 2), map(str.split, it.chain.from_iterable(item_data_lol)))))
虽然这很丑陋,但要打破它:
words = it.chain.from_iterable(item_data_lol)
products_lists = map(str.split, words)
combinations = map(lambda x: it.combinations(x,2) products_list)
c = Counter(it.chain.from_iterable(combinations))