我在文本文档中应用TFIDF,其中我获得了各自对应于文档的不同长度的n维向量。
texts = [[token for token in text if frequency[token] > 1] for text in texts]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda = models.ldamodel.LdaModel(corpus, num_topics=100, id2word=dictionary)
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=100)
corpus_lsi = lsi[corpus_tfidf]
corpus_lda=lda[corpus]
print "TFIDF:"
print corpus_tfidf[1]
print "__________________________________________"
print corpus_tfidf[2]
输出到:
TFIDF:
Vec1: [(19, 0.06602704727889631), (32, 0.360417819987515), (33, 0.3078487494326974), (34, 0.360417819987515), (35, 0.2458968255872351), (36, 0.23680107692707422), (37, 0.29225639811281434), (38, 0.31741275088103), (39, 0.28571949457481044), (40, 0.32872456368129543), (41, 0.3855741727557306)]
__________________________________________
Vec2: [(5, 0.05617283528623041), (6, 0.10499864499395724), (8, 0.11265354901199849), (16, 0.028248249837939252), (19, 0.03948130674177094), (29, 0.07013501129200184), (33, 0.18408018239985235), (42, 0.14904146984986072), (43, 0.20484144632880313), (44, 0.215514203535732), (45, 0.15836501876891904), (46, 0.08505477582234795), (47, 0.07138425858136686), (48, 0.127695955436003), (49, 0.18408018239985235), (50, 0.2305566099597365), (51, 0.20484144632880313), (52, 0.2305566099597365), (53, 0.2305566099597365), (54, 0.053099690797234665), (55, 0.2305566099597365), (56, 0.2305566099597365), (57, 0.2305566099597365), (58, 0.0881162347543671), (59, 0.20484144632880313), (60, 0.16408387627386525), (61, 0.08256873616398946), (62, 0.215514203535732), (63, 0.2305566099597365), (64, 0.16731192344738707), (65, 0.2305566099597365), (66, 0.2305566099597365), (67, 0.07320703902661252), (68, 0.17912628269786976), (69, 0.12332630621892736)]
未表示的向量点为0.表示向量中不存在(18,....),则为0.
我想对这些向量(Vec1和Vec2)应用K表示聚类
Scikit的K意味着聚类需要具有相同维度和矩阵格式的向量。该怎么办呢?
答案 0 :(得分:1)
因此,在查看源代码后,看起来gensim手动为每个文档创建一个稀疏向量(这只是一个元组列表)。这使得错误有意义,因为scikit-learn的kMeans算法允许稀疏的scipy矩阵,但它不知道如何解释gensim稀疏向量。您可以使用以下内容将每个单独的列表转换为scipy csr_matrix(最好一次转换所有文档,但这是一个快速修复)。
rows = [0] * len(corpus_tfidf[1])
cols = [tup[0] for tup in corpus_tfidf[1]]
data = [tup[1] for tup in corpus_tfidf[1]]
sparse_vec = csr_matrix((data, (rows, cols)))
您应该可以使用此sparse_vec
,但如果它抛出错误,您可以将其变为密集的numpy数组,其中.toarray()
或numpy矩阵与.todense()
。< / p>
编辑:事实证明,Gensim提供了一些漂亮的实用功能,包括采用流式语料库对象格式并返回csc矩阵的功能。以下是您的代码如何工作的完整示例(连接到sklearn的kMeans聚类算法)
from gensim import corpora, models, matutils
from sklearn.cluster import KMeans
texts = [[token for token in text] for text in texts]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
print "TFIDF:"
corpus_tfidf = matutils.corpus2csc(corpus_tfidf).transpose()
print corpus_tfidf
print "__________________________________________"
kmeans = KMeans(n_clusters=2)
print kmeans.fit_predict(corpus_tfidf)
您应该计算并传递进入corpus2csc的其他参数,因为它可以根据语料库的大小保存周期。我们将矩阵转置为gensim将文档放在列中,将术语放在行中。您可以将scipy稀疏矩阵转换为无数其他类型,具体取决于您的用例(除了kmeans聚类)。