我正在做新闻推荐系统,我需要为用户和他们阅读的新闻建立一个表格。我的原始数据就像这样:
001436800277225 [12,456,157]
009092130698762 [248]
010003000431538 [361,521,83]
010156461231357 [173,67,244]
010216216021063 [203,97]
010720006581483 [86]
011199797794333 [142,12,86,411,201]
011337201765123 [123,41]
011414545455156 [62,45,621,435]
011425002581540 [341,214,286]
第一列是userID
,第二列是newsID
。newsID
是索引列,例如,转换后,第一行中的[12,456,157]
表示该用户已阅读第12条,第456条和第157条新闻(在稀疏向量中,第12列,第456列和第157列为1
,而其他列的值为0
)。我想将这些数据更改为稀疏矢量格式,可以用作Kmeans中的输入向量或sklearn
的DBscan算法。
我怎么能这样做?
答案 0 :(得分:1)
一种选择是明确地构造稀疏矩阵。我经常发现在COO matrix format中构建矩阵然后转换为CSR format更容易。
from scipy.sparse import coo_matrix
input_data = [
("001436800277225", [12,456,157]),
("009092130698762", [248]),
("010003000431538", [361,521,83]),
("010156461231357", [173,67,244])
]
NUMBER_MOVIES = 1000 # maximum index of the movies in the data
NUMBER_USERS = len(input_data) # number of users in the model
# you'll probably want to have a way to lookup the index for a given user id.
user_row_map = {}
user_row_index = 0
# structures for coo format
I,J,data = [],[],[]
for user, movies in input_data:
if user not in user_row_map:
user_row_map[user] = user_row_index
user_row_index+=1
for movie in movies:
I.append(user_row_map[user])
J.append(movie)
data.append(1) # number of times users watched the movie
# create the matrix in COO format; then cast it to CSR which is much easier to use
feature_matrix = coo_matrix((data, (I,J)), shape=(NUMBER_USERS, NUMBER_MOVIES)).tocsr()
答案 1 :(得分:1)
使用MultiLabelBinarizer
sklearn.preprocessing
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df.newsID), columns=mlb.classes_)
12 41 45 62 67 83 86 97 123 142 ... 244 248 286 341 361 411 435 456 521 621
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 1 0 0 0 1 0
3 0 0 0 0 1 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 1 0 0 0 0 0 1 0 0 1 ... 0 0 0 0 0 1 0 0 0 0
7 0 1 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
8 0 0 1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 1
9 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 0 0 0 0 0 0