Generating test set for recommendation engine

时间:2016-04-15 15:03:07

标签: machine-learning recommendation-engine collaborative-filtering

I am working on a recommendation engine based on implicit feedback. I was using this link : http://insightdatascience.com/blog/explicit_matrix_factorization.html#movielens

This used ALS(Alternating least squares) to compute the user and item vectors. Since, my data set cannot be partitioned by time. I am randomly taking 'x' number of ratings from a user and putting them into the test set. This is a reproducible example of my training user-item matrix.


col1    col2     col3   col4   col5   col6    col7     col8    col9   col10   col1    col12    col13 
+---------------------------------------------------------------------------------------------------+
| 1        0       0     3      10      0       0         3        0      0       1       0        0 |                                                                                   | 
| 0        0       0     5      0       0        1         8        0      0       1       0        0 |                                                                                  |
| 0        0       0     6      7       1        0         2        0      0       1       0        0 |                                                                                   |
+---------------------------------------------------------------------------------------------------+
I then create a test set using this piece of code
    test_ratings = np.random.choice(counts[user,:].nonzero()[0],size=1,replace=True)
        train[user,test_ratings] = 0
        test[user,test_ratings] = counts[user,test_ratings]  
        assert(np.all((train * test) == 0)) 

Which gives me:

col1    col2     col3   col4   col5   col6    col7     col8    col9   col10   col1    col12    col13 
+---------------------------------------------------------------------------------------------------+
| 0        0       0     0      0      0       0         3        0      0       0       0        0 |                                                                                   | 
| 0        0       0     0      0      0       1         0        0      0       0       0        0 |                                                                                  |
| 0        0       0     6      0      0       0         0        0      0       0       0        0 |                                                                                   |
+---------------------------------------------------------------------------------------------------+

Here the rows are users and columns are items.

Now, I was wondering if this is a correct representation of my test set. I have picked up one non zero value and made everything zero. So, my algorithm should be ranking the non zero value as the recommended item.

Is this the correct way of going about things?

Any help would be really appreciated

1 个答案:

答案 0 :(得分:1)

已更新:

是的,您应该使用一些原始计数创建一个测试集,并查看您的系统是否将这些用户项识别为匹配。

你应该小心一些事情:

  • 仅为您拥有的项目或用户设置测试集值 更多数据;
  • 隐藏训练数据中的测试集值;
  • 仅在您拥有数据的用户 - 项目对上训练您的模型,而不是在0上 - 这是因为假设您的0表示您没有数据的对,而不是真实的评级;

注意:此报告Collaborative Filtering for Implicit Feedback Datasets应该可以帮助您解决这些问题和其他问题。