给出购买事件列表(customer_id,item)
1-hammer
1-screwdriver
1-nails
2-hammer
2-nails
3-screws
3-screwdriver
4-nails
4-screws
我正在尝试构建一个数据结构,告诉我用另一个项目购买商品的次数。不是同时买的,而是因为我开始保存数据而买的。结果看起来像
{
hammer : {screwdriver : 1, nails : 2},
screwdriver : {hammer : 1, screws : 1, nails : 1},
screws : {screwdriver : 1, nails : 1},
nails : {hammer : 1, screws : 1, screwdriver : 1}
}
表示用钉子两次(人1,3)和一把螺丝刀(人1)买了一把锤子,用螺丝刀买了一次螺钉(人3),等等......
我目前的做法是
users = dict其中userid是键,而购买的商品列表是值
usersForItem = dict其中itemid是关键字,购买项目的用户列表是值
userlist =对当前项目进行评级的临时用户列表
pseudo:
for each event(customer,item)(sorted by item):
add user to users dict if not exists, and add the items
add item to items dict if not exists, and add the user
----------
for item,user in rows:
# add the user to the users dict if they don't already exist.
users[user]=users.get(user,[])
# append the current item_id to the list of items rated by the current user
users[user].append(item)
if item != last_item:
# we just started a new item which means we just finished processing an item
# write the userlist for the last item to the usersForItem dictionary.
if last_item != None:
usersForItem[last_item]=userlist
userlist=[user]
last_item = item
items.append(item)
else:
userlist.append(user)
usersForItem[last_item]=userlist
所以,在这一点上,我有2个词 - 谁买了什么,以及谁买了什么。这是它变得棘手的地方。现在填充了usersForItem,我遍历它,遍历购买该项目的每个用户,并查看用户的其他购买。我承认这不是最恐怖的做事方式 - 我试图确保在得到Python之前得到正确的结果(我是)。
relatedItems = {}
for key,listOfUsers in usersForItem.iteritems():
relatedItems[key]={}
related=[]
for ux in listOfReaders:
for itemRead in users[ux]:
if itemRead != key:
if itemRead not in related:
related.append(itemRead)
relatedItems[key][itemRead]= relatedItems[key].get(itemRead,0) + 1
calc jaccard/tanimoto similarity between relatedItems[key] and its values
我可以采用更有效的方式吗?此外,如果这种类型的操作有适当的学术名称,我很乐意听到它。
编辑:澄清包含这样一个事实,即我不会将购买限制在同时购买的商品上。物品可以随时购买。
答案 0 :(得分:3)
你真的需要预先计算所有可能的对吗?如果你懒得这样做,即按需点什么呢?
这可以表示为2D矩阵。行对应于客户,列对应于产品。
每个条目都是0或1,表示对应于该列的产品是否是由与该行对应的客户购买的。
如果你看每列作为(大约5000)0和1的矢量,那么两个产品一起购买的次数只是相应矢量的点积!
因此,您可以先计算这些向量,然后根据需要懒惰地计算点积。
计算点积:
现在,一个只有0和1的向量的良好表示是一个整数数组,它基本上是一个位图。
对于5000个条目,您将需要一个包含79个64位整数的数组。
因此,给定两个这样的数组,您需要计算常见的1的数量。
要计算两个整数共有的位数,首先可以按位AND,然后计算结果数中设置的1的数量。
对于这个,您可以使用查找表或一些bitcounting方法(不确定python是否支持它们),如下所示:http://graphics.stanford.edu/~seander/bithacks.html
所以你的算法将是这样的:
为每个产品初始化一个包含79个64位整数的数组。
对于每位客户,请查看购买的产品,并在相应的产品中为该客户设置适当的位。
现在,您需要了解两个产品的查询,您需要知道一起购买它们的客户数量,请按照上述说明选择点积。
这应该相当快。
作为进一步的优化,您可以考虑对客户进行分组。
答案 1 :(得分:2)
events = """\
1-hammer
1-screwdriver
1-nails
2-hammer
2-nails
3-screws
3-screwdriver
4-nails
4-screws""".splitlines()
events = sorted(map(str.strip,e.split('-')) for e in events)
from collections import defaultdict
from itertools import groupby
# tally each occurrence of each pair of items
summary = defaultdict(int)
for val,items in groupby(events, key=lambda x:x[0]):
items = sorted(it[1] for it in items)
for i,item1 in enumerate(items):
for item2 in items[i+1:]:
summary[(item1,item2)] += 1
summary[(item2,item1)] += 1
# now convert raw pair counts into friendlier lookup table
pairmap = defaultdict(dict)
for k,v in summary.items():
item1, item2 = k
pairmap[item1][item2] = v
# print the results
for k,v in sorted(pairmap.items()):
print k,':',v
给出:
hammer : {'nails': 2, 'screwdriver': 1}
nails : {'screws': 1, 'hammer': 2, 'screwdriver': 1}
screwdriver : {'screws': 1, 'nails': 1, 'hammer': 1}
screws : {'nails': 1, 'screwdriver': 1}
(这通过购买活动解决您的初始请求分组项目。要按用户分组,只需将事件列表的第一个键从事件编号更改为用户ID。)
答案 2 :(得分:1)
保罗的回答可能是最好的,但这是我在午休时提出的结果(未经考验,无可否认,但仍然是一个有趣的思考练习)。不确定我的算法的速度/优化。我个人建议看一下像Noo数据库MongoDB这样的东西,因为它似乎可以很好地解决这类问题(map / reduce和所有这些)
# assuming events is a dictionary of id keyed to item bought...
user = {}
for (cust_id, item) in events:
if not cust_id in users:
user[cust_id] = set()
user[cust_id].add(item)
# now we have a dictionary of cust_ids keyed to a set of every item
# they've ever bought (given that repeats don't matter)
# now we construct a dict of items keyed to a dictionary of other items
# which are in turn keyed to num times present
items = {}
def insertOrIter(d, k, v):
if k in d:
d[k] += v
else:
d[k] = v
for key in user:
# keep track of items bought with each other
itemsbyuser = []
for item in user[key]:
# make sure the item with dict is set up
if not item in items:
items[item] = {}
# as we see each item, add to it others and others to it
for other in itemsbyuser:
insertOrIter(items[other], item, 1)
insertOrIter(items[item], other, 1)
itemsbyuser.append(item)
# now, unless i've screwed up my logic, we have a dictionary of items keyed
# to a dictionary of other items keyed to how many times they've been
# bought with the first item. *whew*
# If you want something more (potentially) useful, we just turn that around to be a
# dictionary of items keyed to a list of tuples of (times seen, other item) and
# you're good to go.
useful = {}
for i in items:
temp = []
for other in items[i]:
temp[].append((items[i][other], other))
useful[i] = sorted(temp, reverse=True)
# Now you should have a dictionary of items keyed to tuples of
# (number times bought with item, other item) sorted in descending order of
# number of times bought together
答案 3 :(得分:1)
相当奇怪的是,每次想要获取统计数据时,所有解决方案都会在整个数据库中流失以获得计数。
建议将数据保持平整,索引并仅获取特定项目的结果,当时只有一项。如果您的项目数量很大,那么我会更有效率。
from collections import defaultdict
from itertools import groupby
class myDB:
'''Example of "indexed" "database" of orders <-> items on order'''
def __init__(self):
self.id_based_index = defaultdict(set)
self.item_based_index = defaultdict(set)
def add(self, order_data):
for id, item in order_data:
self.id_based_index[id].add(item)
self.item_based_index[item].add(id)
def get_compliments(self, item):
all_items = []
for id in self.item_based_index[item]:
all_items.extend(self.id_based_index[id])
gi = groupby(sorted(all_items), lambda x: x)
return dict([(k, len(list(g))) for k, g in gi])
使用它的例子:
events = """1-hammer
1-screwdriver
1-nails
2-hammer
2-nails
3-screws
3-screwdriver
4-nails
4-screws"""
db = myDB()
db.add(
[ map(str.strip,e.split('-')) for e in events.splitlines() ]
)
# index is incrementally increased
db.add([['5','plunger'],['5','beer']])
# this scans and counts only needed items
assert db.get_compliments('NotToBeFound') == {}
assert db.get_compliments('hammer') == {'nails': 2, 'hammer': 2, 'screwdriver': 1}
# you get back the count for the requested product as well. Discard if not needed.
这很有趣,但是,严肃地说,只需要真正的数据库存储。因为索引已经内置到任何数据库引擎中,所以SQL中的所有代码都只是:
select
p_others.product_name,
count(1) cnt
from products p
join order_product_map opm
on p.product_id = opm.product_id
join products p_others
on opm.product_id = p_others.product_id
where p.product_name in ('hammer')
group by p_others.product_name