我想获取MovieLens标记数据集的 tf-idf 表示。这些标签位于'格式:
import pandas as pd
ratings = pd.read_csv('data/ratings.csv',sep=',')
movies = pd.read_csv('data/movies.csv',sep=',')
tags = pd.read_csv('data/tags.csv',sep=',')
print(tags)
userId movieId tag \
0 15 339 sandra 'boring' bullock
1 15 1955 dentist
2 15 7478 Cambodia
3 15 32892 Russian
4 15 34162 forgettable
5 15 35957 short
6 15 37729 dull story
7 15 45950 powerpoint
8 15 100365 activist
9 15 100365 documentary
10 15 100365 uganda
11 23 150 Ron Howard
...
我的tf-idf代码的第一个版本如下所示:
vectorizer = TfidfVectorizer(use_idf=True, norm= 'l2')
X = vectorizer.fit_transform(tags['tag'])
print(X)
(0, 89) 0.603928505945
(0, 80) 0.52013528953
(0, 577) 0.603928505945
(1, 160) 1.0
(2, 94) 1.0
(3, 573) 1.0
(4, 255) 1.0
(5, 604) 1.0
...
虽然看起来不错,但这并不是我想要的确切表现。有两个主要问题:
如果你能让我知道如何解决上述情况会很好,我认为这是一件非常容易的事。
答案 0 :(得分:0)
<强>输入强>
userId movieId tag
15 339 sandra 'boring' bullock
15 1955 dentist
15 7478 Cambodia
15 32892 Russian
15 34162 forgettable
15 35957 short
15 37729 dull story
15 45950 powerpoint
15 100365 activist
15 100365 documentary
15 100365 uganda
23 150 Ron Howard
<强>代码强>
import pandas as pd
# consolidated dataset
tags = pd.read_csv('tfidf_input1.csv')
concatenated_tags = tags.groupby('movieId')['tag'].apply(lambda x: "%s" % ' '.join(x)).reset_index()
#print concatenated_tags
# TfidfVectorization
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(concatenated_tags['tag'])
#print X
# knowing IDs in tftdf matrix
# you have to convert to dense [NOT AT ALL advised for large matrices]
# the output is a compressed sparse matrix for the memory reason
X_dense = X.todense()
print vec.get_feature_names()
print X_dense[0,:] # output for the first movieId