我正在关注sklearn(http://scikit-learn.org/stable/auto_examples/bicluster/plot_bicluster_newsgroups.html#sphx-glr-auto-examples-bicluster-plot-bicluster-newsgroups-py)上的文档和单词谱的频谱聚类,并且我的群集非常不平衡。我有输出:
Vectorizing...
Coclustering...
Done in 7.20s. V-measure: 0.4267
MiniBatchKMeans...
Done in 9.13s. V-measure: 0.4414
Best biclusters:
----------------
bicluster 0 : 8 documents, 6 words
categories : 100% talk.politics.mideast
words : angmar, cosmo, alfalfa, alphalpha, proline, benson
bicluster 1 : 4 documents, 9 words
categories : 100% comp.windows.x
words : elin, eeam, ges, energeanwendung, penzingerstr, gesmbh, energieanwendung, hochreiter, wien
bicluster 2 : 14 documents, 33 words
categories : 86% comp.windows.x, 14% talk.politics.mideast
words : rpicas, porto, wg2, se05, libxmu, waii, xmu, picas, inescn, ep130
bicluster 3 : 2809 documents, 4242 words
categories : 25% comp.windows.x, 21% comp.sys.ibm.pc.hardware, 20% comp.graphics
words : windows, scsi, motif, ide, graphics, pc, card, window, bmug, controller
bicluster 4 : 5166 documents, 5686 words
categories : 16% rec.motorcycles, 15% rec.autos, 14% sci.electronics
words : autos, motorcycles, bike, car, sale, engine, dod, bmw, engr, honda
我不知道为什么它与本教程不同,但是是否有解决方案来使群集更加均衡?
我什至在this link之后使用“ scipy.sparse.linalg的SVD”和sklearn的“ Kmeans”对我的“自己的”光谱共聚进行编码,但是我遇到了同样的问题...
谢谢!