Question

我有一个输入数据集（采用csv格式），包含100246行和7列。这是取自http://grouplens.org/datasets/movielens/的电影评级数据。我的数据框的负责人是：

In [5]: df.head()
Out[5]: 
   movieId                                       genres  userId      rating  \
0        1  Adventure|Animation|Children|Comedy|Fantasy       1       5   
1        1  Adventure|Animation|Children|Comedy|Fantasy       2       3   
2        1  Adventure|Animation|Children|Comedy|Fantasy       5       4   
3        1  Adventure|Animation|Children|Comedy|Fantasy       6       4   
4        1  Adventure|Animation|Children|Comedy|Fantasy       8       3   

 imdbId       title  relDate  
0  114709  Toy Story      1995  
1  114709  Toy Story      1995  
2  114709  Toy Story      1995  
3  114709  Toy Story      1995  
4  114709  Toy Story      1995

使用这个数据集，我正在使用用户评级之间的欧几里德距离来计算每对电影之间的相似性得分（即，如果用户样本对两部电影进行类似评级，则电影高度相关）。目前，这是通过迭代所有电影对并使用if语句来仅查找包含当前感兴趣的电影的那些对来执行的：

  for i,item in enumerate(df['movieId'].unique()):
      for j, item_comb in enumerate(combinations(df['movieId'].unique(),2)):
        if(item in item_comb ):
              ## calculate the similarity score between item i and the other item in item_comb

然而，假设数据集中有8927个不同的电影，则对的数量约为40M。这是一个主要的瓶颈。所以我的问题是我可以通过哪些方法加速我的代码？

Answer 1

在此链接（collaborative-filtering scalability）中，可以使用MongoDB对超大型数据集使用协作过滤器。

Spark（collaborative-filter with Apache Spark)也可能适用。

Answer 2

有些方法可以将迭代相似度计算转换为矩阵乘法。如果您使用余弦相似度，则会在this stack exchange question的答案中更详细地解释转换。

另一种方法是在scikit-learn包中使用成对相似性度量，该实现具有cosine similarity的实现。

from scikit-learn.metrics.pairwise import cosine_similarity
user_ratings_df = ....            # create the user x item dataframe

# Note the dataframe is transposed to convert to items as rows 
item_similarity = cosine_similarity(user_ratings_df.T)

Answer 3

矢量化比循环更好。

可能有两个pandas功能有用：pivot_table()和corr()

例如：

In [5]: pt = df.pivot_table(columns=['movieId'], index=['userId'], values='rating')
Out[5]: 
   movieId       1    2    3    4    5
   userId                           
         1       5   ...
         2       3   ...
         5       4   ...
         6       4   ...
         8       3   ...

In [6]: pt.corr()
Out[6]: 
   movieId       1    2    3    4    5
   movieId                           
         1       1.0     ...
         2       0.XXX   ...
         3       0.XXX   ...
         4       0.XXX   ...
         5       0.XXX   ...

请注意，此处的corr（）计算电影之间的标准相关系数（皮尔森相关性）而不是欧几里德距离。您还可以使用param min_periods设置每对列所需的最小观察数，以获得有效结果。

Answer 4

在这个paper中，它反感你可以用另一种方法快速算法

在亚马逊论文（2003年）中，他们已经正确地描述了它。

总之，这个算法背后最重要的思想是以另一种方式计算两个向量的点积而不是遍历每个向量元素简单。通过这种方式，该算法仅在具有相同客户时计算两个项目。换句话说，它会跳过0相似性计算

   For each item in product catalog, I1
      For each customer C who purchased I1
        For each item I2 purchased by customer C
          Record that a customer purchased I1 and I2
      For each item I2
        Compute the similarity between I1 and I2

使用pandas数据框和Python嵌套for循环的基于项目的协同过滤器的瓶颈

4 个答案: