如何运行按“列”值分组的Analysis,而不是使用整个数据集

时间:2019-08-12 14:01:34

标签: python pandas pandas-groupby

我正在使用Python product recommendation system(请参见答案Mohsin hasan)。

简单脚本将两个变量(UserId,ItemId)作为输入,并将两个产品之间的亲和力得分作为输出。

但是,我添加了第三列(国家)。 我想针对每个国家(而不是整个数据框架)分别进行分析

最初,我使用R,其中dplyr的'group_by'函数应该有所帮助。但目前我被困住了(请参阅下面的尝试)。有人知道我如何在每个国家/地区进行此分析吗? (我觉得'pandas.DataFrame.groupby'也可以解决此问题,而不是尝试使用for循环)。

示例数据(请注意:唯一的区别是我添加了国家/地区列:

UserId      ItemId          Country

1           Babyphone       Netherlands
1           Babyphone       Netherlands
1           CoffeeMachine   Netherlands
2           CoffeeMachine   Netherlands
2           Shaver          Netherlands
3           Shaver          Netherlands
3           CoffeeMachine   Netherlands
4           CoffeeMachine   Netherlands
4           Shaver          Netherlands
4           Blender         Netherlands
5           Blender         Netherlands
5           BabyPhone       Netherlands
5           Shaver          Netherlands
6           Shaver          Netherlands
7           CoffeeMachine   Netherlands
7           CoffeeMachine   Netherlands
8           BabyPhone       Netherlands
9           Blender         Netherlands
9           Blender         Netherlands   
1           Babyphone       Germany
1           Babyphone       Germany
1           CoffeeMachine   Germany
2           CoffeeMachine   Germany
2           Shaver          Germany
3           Shaver          Germany
3           CoffeeMachine   Germany
4           CoffeeMachine   Germany
4           Shaver          Germany
4           Blender         Germany
5           Blender         Germany
5           BabyPhone       Germany
5           Shaver          Germany
6           Shaver          Germany
7           CoffeeMachine   Germany
7           CoffeeMachine   Germany
8           BabyPhone       Germany
9           Blender         Germany
9           Blender         Germany

工作原始代码(使用UserId和ItemId,不使用国家/地区)

# main is our data.

# get unique items
items = set(main.productId)

n_users = len(set(main.userId))

# make a dictionary of item and users who bought that item
item_users = main.groupby('productId')['userId'].apply(set).to_dict()

# iterate over combinations of item1 and item2 and store scores
result = []
for item1, item2 in itertools.combinations(items, 2):

  score = len(item_users[item1] & item_users[item2]) / n_users
  item_tuples = [(item1, item2), (item2, item1)]
  result.append((item1, item2, score))
  result.append((item2, item1, score)) # store score for reverse order as well

# convert results to a dataframe
result = pd.DataFrame(result, columns=["item1", "item2", "score"])

我的尝试(适用于国家/地区,但不起作用)。我尝试了什么?

  1. 过滤每个国家的数据框(是的,这很糟糕,因为它不是动态的)
  2. 遍历数据框(每个国家/地区都有1个数据框)
  3. 尝试插入解决方案(请参见上文),然后分别申请数据框。
  4. 如您所见,很遗憾,它无法正常工作...

           Netherlands = df.loc[df['Country'] == 'Netherlands']
           Germany     = df.loc[df['Country'] == 'Germany']
           results = []
           for dataset in (Netherlands, Germany):
               for index, row in dataset.iterrows():
               Country = row['Country'] # Need to bind the name of the df later to the results 
    
               items = set(dataset.ItemId) #Get unique Items per country
               n_users = len(set(dataset.UserId) # Get unique number of users per country 
               item_users = dataset.groupby('ItemId'['UserId'].apply(set).to_dict() # I tried to add country here, but without results. 
    
               for item1, item2 in itertools.combinations(items, 2):
                    print("item1", item1)
                    print("item2", item2)
                    score = len(item_users[item1] & item_users[item2]) / n_users
                    item_tuples = [(item1, item2), (item2, item1)]
                    result.append((item1, item2, score))
                    result.append((item2, item1, score)) # store score for reverse order as well
                    result = pd.DataFrame(result, columns=["item1", "item2", "score"])
    

Edit1:预期输出

enter image description here

编辑2 :分数是如何计算的? 得分表示:有多少客户一起购买产品组合。

例如,在数据中您看到剃须刀和咖啡机= 0.333(因为9人中有3人按国家(地区)购买了此组合)。在第一个代码中,得分运行良好。但是,我无法在每个国家/地区运行它(这是此处的关键问题)。

非常感谢!

1 个答案:

答案 0 :(得分:1)

你在这里

= ^ .. ^ =

正如您提到的by by所用。首先将得分循环带其他“国家”字段进入功能,然后在分组数据帧上使用它,如下所示:

import pandas as pd
import itertools

将得分移入功能:

def get_score(item):
    country = item[0]
    df = item[1]

    # get unique items
    items = set(df.ItemId)
    n_users = len(set(df.UserId))

    # make a dictionary of item and users who bought that item
    item_users = df.groupby('ItemId')['UserId'].apply(set).to_dict()

    # iterate over combinations of item1 and item2 and store scores
    result = []
    for item1, item2 in itertools.combinations(items, 2):

      score = len(item_users[item1] & item_users[item2]) / n_users
      item_tuples = [(item1, item2), (item2, item1)]
      result.append((item1, item2, score, country))
      result.append((item2, item1, score, country)) # store score for reverse order as well

    # convert results to a dataframe
    result = pd.DataFrame(result, columns=["item1", "item2", "score", 'country'])
    return result

按国家/地区分组数据,然后遍历每个组以获取得分:

grouped_data = df.groupby(['Country'])

df_list = []
for item in list(grouped_data):
    df_list.append(get_score(item))

# concat frames
df = pd.concat(df_list)
# remove rows with 0 score
df = df[df['score'] > 0]

输出:

            item1          item2     score      country
0       BabyPhone        Blender  0.111111      Germany
1         Blender      BabyPhone  0.111111      Germany
4       BabyPhone         Shaver  0.111111      Germany
5          Shaver      BabyPhone  0.111111      Germany
8         Blender  CoffeeMachine  0.111111      Germany
9   CoffeeMachine        Blender  0.111111      Germany
10        Blender         Shaver  0.222222      Germany
11         Shaver        Blender  0.222222      Germany
14  CoffeeMachine         Shaver  0.333333      Germany
15         Shaver  CoffeeMachine  0.333333      Germany
16  CoffeeMachine      Babyphone  0.111111      Germany
17      Babyphone  CoffeeMachine  0.111111      Germany
0       BabyPhone        Blender  0.111111  Netherlands
1         Blender      BabyPhone  0.111111  Netherlands
4       BabyPhone         Shaver  0.111111  Netherlands
5          Shaver      BabyPhone  0.111111  Netherlands
8         Blender  CoffeeMachine  0.111111  Netherlands
9   CoffeeMachine        Blender  0.111111  Netherlands
10        Blender         Shaver  0.222222  Netherlands
11         Shaver        Blender  0.222222  Netherlands
14  CoffeeMachine         Shaver  0.333333  Netherlands
15         Shaver  CoffeeMachine  0.333333  Netherlands
16  CoffeeMachine      Babyphone  0.111111  Netherlands
17      Babyphone  CoffeeMachine  0.111111  Netherlands