在Python中使用Sklearn生成的Scipy稀疏矩阵

时间:2015-06-24 23:41:54

标签: python numpy scipy scikit-learn

我使用sklearn的Kmeans算法编写了60个文档的代码:

选择1:获得token_dict(可能不那么重要):

for hashtag in hashtags:
    ob = users_tweeting(hashtag, 20)
    tweets = ob[1]
    overall = ""
    for tweet in tweets:
        tweet = tweet.lstrip()
        tweet = tweet.rstrip()
        # processing here
        overall += " "
        overall += tweet

    lower = overall.lower()
    nopunct = punctuation_marks.sub("", lower)        
    token_dict[hashtag] = nopunct

选择2:对文档进行矢量化和聚类

tfidf = TfidfVectorizer(tokenizer=tokenize, ngram_range = (1, 5))
tfs = tfidf.fit_transform(token_dict.values())
X = tfs

print("n_samples: %d, n_features: %d" % X.shape)

km = KMeans(n_clusters=3, init='k-means++', max_iter=100, n_init=10, tol = 1e-8, verbose=True)

print("Clustering sparse data with %s" % km)
t0 = time()
km.fit(X)
print("done in %0.3fs" % (time() - t0))

labels = km.labels_
centroids = km.cluster_centers_
figure = pl.figure(1)
ax = Axes3D(figure)
ax.scatter(X[:, 0], X[:, 1], X[:, 2])
pl.show()

X是Scipy稀疏矩阵,看起来像

(0, 4558)     0.076421768112
(0, 5427)     0.015537938012
(0, 12380)    0.00517931267068
(0, 12554)    0.00517931267068
(0, 522)      0.116643751329
(0, 14100)    0.0120665949651
(0, 6851)     0.0723995697903
(0, 13100)    0.144799139581
(0, 14642)    0.0241331899
...

获得的错误是

Traceback (most recent call last):
  File "features.py", line 185, in <module>
    ob = keywords(['#happy', '#sad', '#feelingsick'])
  File "features.py", line 106, in keywords
    ax.scatter(X[:, 0], X[:, 1], X[:, 2])
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/mpl_toolkits/mplot3d/axes3d.py", line 2180, in scatter
    patches = Axes.scatter(self, xs, ys, s=s, c=c, *args, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axes.py", line 6337, in scatter
    self.add_collection(collection)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axes.py", line 1481, in add_collection
    self.update_datalim(collection.get_datalim(self.transData))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/collections.py", line 185, in get_datalim
    offsets = np.asanyarray(offsets, np.float_)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/numeric.py", line 512, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
ValueError: setting an array element with a sequence.

这与here提到的错误几乎相同,但我不确定如何解决它。目标是绘制聚类(而不仅仅是质心)。

提前致谢!

0 个答案:

没有答案