Question

我正在做新闻推荐系统，我需要为用户和他们阅读的新闻建立一个表格。我的原始数据就像这样：

001436800277225 [12,456,157]
009092130698762 [248]
010003000431538 [361,521,83]
010156461231357 [173,67,244]
010216216021063 [203,97]
010720006581483 [86]
011199797794333 [142,12,86,411,201]
011337201765123 [123,41]
011414545455156 [62,45,621,435]
011425002581540 [341,214,286]

第一列是userID，第二列是newsID。newsID是索引列，例如，转换后，第一行中的[12,456,157]表示该用户已阅读第12条，第456条和第157条新闻（在稀疏向量中，第12列，第456列和第157列为1，而其他列的值为0）。我想将这些数据更改为稀疏矢量格式，可以用作Kmeans中的输入向量或sklearn的DBscan算法。我怎么能这样做？

Answer 1

一种选择是明确地构造稀疏矩阵。我经常发现在COO matrix format中构建矩阵然后转换为CSR format更容易。

from scipy.sparse import coo_matrix

input_data = [
    ("001436800277225", [12,456,157]),
    ("009092130698762", [248]),
    ("010003000431538", [361,521,83]),
    ("010156461231357", [173,67,244])    
]

NUMBER_MOVIES = 1000 # maximum index of the movies in the data
NUMBER_USERS = len(input_data) # number of users in the model

# you'll probably want to have a way to lookup the index for a given user id.
user_row_map = {}
user_row_index = 0

# structures for coo format
I,J,data = [],[],[]
for user, movies in input_data:

    if user not in user_row_map:
        user_row_map[user] = user_row_index
        user_row_index+=1

    for movie in movies:
        I.append(user_row_map[user])
        J.append(movie)
        data.append(1)  # number of times users watched the movie

# create the matrix in COO format; then cast it to CSR which is much easier to use
feature_matrix = coo_matrix((data, (I,J)), shape=(NUMBER_USERS, NUMBER_MOVIES)).tocsr()

Answer 2

使用MultiLabelBinarizer

中的sklearn.preprocessing

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

pd.DataFrame(mlb.fit_transform(df.newsID), columns=mlb.classes_)

   12   41   45   62   67   83   86   97   123  142 ...   244  248  286  341  361  411  435  456  521  621
0    1    0    0    0    0    0    0    0    0    0 ...     0    0    0    0    0    0    0    1    0    0
1    0    0    0    0    0    0    0    0    0    0 ...     0    1    0    0    0    0    0    0    0    0
2    0    0    0    0    0    1    0    0    0    0 ...     0    0    0    0    1    0    0    0    1    0
3    0    0    0    0    1    0    0    0    0    0 ...     1    0    0    0    0    0    0    0    0    0
4    0    0    0    0    0    0    0    1    0    0 ...     0    0    0    0    0    0    0    0    0    0
5    0    0    0    0    0    0    1    0    0    0 ...     0    0    0    0    0    0    0    0    0    0
6    1    0    0    0    0    0    1    0    0    1 ...     0    0    0    0    0    1    0    0    0    0
7    0    1    0    0    0    0    0    0    1    0 ...     0    0    0    0    0    0    0    0    0    0
8    0    0    1    1    0    0    0    0    0    0 ...     0    0    0    0    0    0    1    0    0    1
9    0    0    0    0    0    0    0    0    0    0 ...     0    0    1    1    0    0    0    0    0    0

如何将索引向量更改为可在sklearn中使用的稀疏特征向量？

2 个答案: