我正在做一个简单的协同过滤(CF)。这是一个项目到项目CF. 例如,我有一个包含N个项目的巨型字典,其中键是产品名称,value是购买它们的客户列表:
d={
item1:[customer1,customer3,customer7],
item2:[customer3, customer5],
...
itemN:[customerX...customerY]
}
我还有一个小函数来计算每个项目之间客户的相似性,例如: item1 vs item2 :
def littlefunction(...):
#convert them to a set
item1=set(d['item1'])
item2=set(d['item2'])
commonCustomer=item1.intersect(item2)
totalCustomer=item1.union(item2)
similarity=float(len(commonCustomer))/len(totalCustomer)
要获得指定的每个项目的顶部相似项目,我必须扫描,并计算n次的similiarty然后排序。因此对于N个项目,复杂度为O(N*N)
。
每个项目的运行时间现在是2分钟(N约= 3百万)。要生成完整的N * N相似性表,需要100,000小时。任何比这种蛮力方法更好的算法?每个项目只需要前几个结果。
答案 0 :(得分:4)
创建具有以下内容的倒排索引:
customer1: [item1, item3, item8, ...]
customer2: [item7, item8, item74, ...]
然后你可以:
每件物品的时间应该从2分钟到不到2秒钟。
第二个索引需要更多内存,但您不会复制数据。如果内存存在问题,您可以将其存储在一个简单的数据库中,并且仍然比您当前使用的N ^ 2算法快得多。
您想创建一个N * N矩阵,显示任意两个项目之间的相似性。使用我的技术,您可以执行以下操作:
Create an N*N matrix, and initialize it to 0.
for each item
Get the list of customers who bought the item (from your item-to-customer index).
Create an empty dictionary of related items
for each customer in that list
for each item that the customer bought
update the dictionary (add new item, or increase count)
end for
end for
You now have a dictionary that contains the related items,
and how many customers bought each one. You can update the matrix row
for the current item from that dictionary.
end for
答案 1 :(得分:1)
这与@Jim Mischel的答案基本相同,构建一个显式的反向索引,但不构建一个显式的NxN矩阵,而是一个实现而不是整个系统的草图。
对于大量的NUM_CUSTOMERS(比如1,000,000),它的表现不是很好(3次similiarty_for检查约为20秒),但数据是随机的。据推测,实际客户购买数据会有更多的相关性和更少的随机性。
首先,构建一些合成数据进行测试:
import random
import operator
NUM_CUSTOMERS = 10000
NUM_ITEMS = 1000
MIN_PER_CUST = 2
MAX_PER_CUST = 10
NUM_INTEREST = 10
customers = ["customer_{}".format(num) for num in xrange(1, NUM_CUSTOMERS+1)]
items = ["item_{}".format(num) for num in xrange(1, NUM_ITEMS+1)]
customer_items = {
customer: random.sample(items, random.randint(MIN_PER_CUST, MAX_PER_CUST))
for customer in customers}
item_customers = {}
for customer, this_items in customer_items.iteritems():
for item in this_items:
item_customers.setdefault(item, [])
item_customers[item].append(customer)
然后定义几个函数:
def similarity(item1, item2):
item1_set = set(item_customers[item1])
item2_set = set(item_customers[item2])
num_common = len(item1_set.intersection(item2_set))
num_total = len(item1_set.union(item2_set))
return float(num_common) / num_total
def similarity_for(item):
to_check = {
itm for customer in item_customers[item]
for itm in customer_items[customer] }
to_check.discard(item)
rankings = {
item2: similarity(item, item2)
for item2 in to_check }
return sorted(rankings.iteritems(), key=operator.itemgetter(1), reverse=True)
然后,运行它:
for index, item in enumerate(sorted(random.sample(items, 3))):
rankings = similarity_for(item)
print "check {}; {} (of {}):".format(index, item, len(rankings))
for item_other, ranking in rankings[:NUM_INTEREST]:
print " {}: {}".format(item_other, ranking)
获得如下结果:
% python /tmp/similarity.py
check 0; item_121 (of 309):
item_520: 0.0283018867925
item_361: 0.027027027027
item_536: 0.0265486725664
item_637: 0.0238095238095
item_515: 0.020202020202
item_750: 0.019801980198
item_960: 0.0192307692308
item_25: 0.0190476190476
item_548: 0.018691588785
item_841: 0.018691588785
check 1; item_714 (of 298):
item_162: 0.0285714285714
item_491: 0.0272727272727
item_617: 0.0265486725664
item_949: 0.0260869565217
item_788: 0.0192307692308
item_446: 0.0190476190476
item_558: 0.018691588785
item_606: 0.0181818181818
item_177: 0.0181818181818
item_577: 0.018018018018
check 2; item_991 (of 352):
item_758: 0.0298507462687
item_204: 0.025
item_85: 0.0247933884298
item_769: 0.024
item_501: 0.0232558139535
item_860: 0.0227272727273
item_408: 0.0225563909774
item_480: 0.0223880597015
item_73: 0.0220588235294
item_593: 0.021897810219