我正在使用Python product recommendation system(请参见答案Mohsin hasan)。
简单脚本将两个变量(UserId,ItemId)作为输入,并将两个产品之间的亲和力得分作为输出。
但是,我添加了第三列(国家)。 我想针对每个国家(而不是整个数据框架)分别进行分析。
最初,我使用R,其中dplyr的'group_by'函数应该有所帮助。但目前我被困住了(请参阅下面的尝试)。有人知道我如何在每个国家/地区进行此分析吗? (我觉得'pandas.DataFrame.groupby'也可以解决此问题,而不是尝试使用for循环)。
示例数据(请注意:唯一的区别是我添加了国家/地区列:
UserId ItemId Country
1 Babyphone Netherlands
1 Babyphone Netherlands
1 CoffeeMachine Netherlands
2 CoffeeMachine Netherlands
2 Shaver Netherlands
3 Shaver Netherlands
3 CoffeeMachine Netherlands
4 CoffeeMachine Netherlands
4 Shaver Netherlands
4 Blender Netherlands
5 Blender Netherlands
5 BabyPhone Netherlands
5 Shaver Netherlands
6 Shaver Netherlands
7 CoffeeMachine Netherlands
7 CoffeeMachine Netherlands
8 BabyPhone Netherlands
9 Blender Netherlands
9 Blender Netherlands
1 Babyphone Germany
1 Babyphone Germany
1 CoffeeMachine Germany
2 CoffeeMachine Germany
2 Shaver Germany
3 Shaver Germany
3 CoffeeMachine Germany
4 CoffeeMachine Germany
4 Shaver Germany
4 Blender Germany
5 Blender Germany
5 BabyPhone Germany
5 Shaver Germany
6 Shaver Germany
7 CoffeeMachine Germany
7 CoffeeMachine Germany
8 BabyPhone Germany
9 Blender Germany
9 Blender Germany
工作原始代码(使用UserId和ItemId,不使用国家/地区)
# main is our data.
# get unique items
items = set(main.productId)
n_users = len(set(main.userId))
# make a dictionary of item and users who bought that item
item_users = main.groupby('productId')['userId'].apply(set).to_dict()
# iterate over combinations of item1 and item2 and store scores
result = []
for item1, item2 in itertools.combinations(items, 2):
score = len(item_users[item1] & item_users[item2]) / n_users
item_tuples = [(item1, item2), (item2, item1)]
result.append((item1, item2, score))
result.append((item2, item1, score)) # store score for reverse order as well
# convert results to a dataframe
result = pd.DataFrame(result, columns=["item1", "item2", "score"])
我的尝试(适用于国家/地区,但不起作用)。我尝试了什么?
如您所见,很遗憾,它无法正常工作...
Netherlands = df.loc[df['Country'] == 'Netherlands']
Germany = df.loc[df['Country'] == 'Germany']
results = []
for dataset in (Netherlands, Germany):
for index, row in dataset.iterrows():
Country = row['Country'] # Need to bind the name of the df later to the results
items = set(dataset.ItemId) #Get unique Items per country
n_users = len(set(dataset.UserId) # Get unique number of users per country
item_users = dataset.groupby('ItemId'['UserId'].apply(set).to_dict() # I tried to add country here, but without results.
for item1, item2 in itertools.combinations(items, 2):
print("item1", item1)
print("item2", item2)
score = len(item_users[item1] & item_users[item2]) / n_users
item_tuples = [(item1, item2), (item2, item1)]
result.append((item1, item2, score))
result.append((item2, item1, score)) # store score for reverse order as well
result = pd.DataFrame(result, columns=["item1", "item2", "score"])
Edit1:预期输出
编辑2 :分数是如何计算的? 得分表示:有多少客户一起购买产品组合。
例如,在数据中您看到剃须刀和咖啡机= 0.333(因为9人中有3人按国家(地区)购买了此组合)。在第一个代码中,得分运行良好。但是,我无法在每个国家/地区运行它(这是此处的关键问题)。
非常感谢!
答案 0 :(得分:1)
你在这里
= ^ .. ^ =
正如您提到的by by所用。首先将得分循环带其他“国家”字段进入功能,然后在分组数据帧上使用它,如下所示:
import pandas as pd
import itertools
将得分移入功能:
def get_score(item):
country = item[0]
df = item[1]
# get unique items
items = set(df.ItemId)
n_users = len(set(df.UserId))
# make a dictionary of item and users who bought that item
item_users = df.groupby('ItemId')['UserId'].apply(set).to_dict()
# iterate over combinations of item1 and item2 and store scores
result = []
for item1, item2 in itertools.combinations(items, 2):
score = len(item_users[item1] & item_users[item2]) / n_users
item_tuples = [(item1, item2), (item2, item1)]
result.append((item1, item2, score, country))
result.append((item2, item1, score, country)) # store score for reverse order as well
# convert results to a dataframe
result = pd.DataFrame(result, columns=["item1", "item2", "score", 'country'])
return result
按国家/地区分组数据,然后遍历每个组以获取得分:
grouped_data = df.groupby(['Country'])
df_list = []
for item in list(grouped_data):
df_list.append(get_score(item))
# concat frames
df = pd.concat(df_list)
# remove rows with 0 score
df = df[df['score'] > 0]
输出:
item1 item2 score country
0 BabyPhone Blender 0.111111 Germany
1 Blender BabyPhone 0.111111 Germany
4 BabyPhone Shaver 0.111111 Germany
5 Shaver BabyPhone 0.111111 Germany
8 Blender CoffeeMachine 0.111111 Germany
9 CoffeeMachine Blender 0.111111 Germany
10 Blender Shaver 0.222222 Germany
11 Shaver Blender 0.222222 Germany
14 CoffeeMachine Shaver 0.333333 Germany
15 Shaver CoffeeMachine 0.333333 Germany
16 CoffeeMachine Babyphone 0.111111 Germany
17 Babyphone CoffeeMachine 0.111111 Germany
0 BabyPhone Blender 0.111111 Netherlands
1 Blender BabyPhone 0.111111 Netherlands
4 BabyPhone Shaver 0.111111 Netherlands
5 Shaver BabyPhone 0.111111 Netherlands
8 Blender CoffeeMachine 0.111111 Netherlands
9 CoffeeMachine Blender 0.111111 Netherlands
10 Blender Shaver 0.222222 Netherlands
11 Shaver Blender 0.222222 Netherlands
14 CoffeeMachine Shaver 0.333333 Netherlands
15 Shaver CoffeeMachine 0.333333 Netherlands
16 CoffeeMachine Babyphone 0.111111 Netherlands
17 Babyphone CoffeeMachine 0.111111 Netherlands