我使用sklearn
的Kmeans算法编写了60个文档的代码:
选择1:获得token_dict
(可能不那么重要):
for hashtag in hashtags:
ob = users_tweeting(hashtag, 20)
tweets = ob[1]
overall = ""
for tweet in tweets:
tweet = tweet.lstrip()
tweet = tweet.rstrip()
# processing here
overall += " "
overall += tweet
lower = overall.lower()
nopunct = punctuation_marks.sub("", lower)
token_dict[hashtag] = nopunct
选择2:对文档进行矢量化和聚类
tfidf = TfidfVectorizer(tokenizer=tokenize, ngram_range = (1, 5))
tfs = tfidf.fit_transform(token_dict.values())
X = tfs
print("n_samples: %d, n_features: %d" % X.shape)
km = KMeans(n_clusters=3, init='k-means++', max_iter=100, n_init=10, tol = 1e-8, verbose=True)
print("Clustering sparse data with %s" % km)
t0 = time()
km.fit(X)
print("done in %0.3fs" % (time() - t0))
labels = km.labels_
centroids = km.cluster_centers_
figure = pl.figure(1)
ax = Axes3D(figure)
ax.scatter(X[:, 0], X[:, 1], X[:, 2])
pl.show()
X是Scipy稀疏矩阵,看起来像
(0, 4558) 0.076421768112
(0, 5427) 0.015537938012
(0, 12380) 0.00517931267068
(0, 12554) 0.00517931267068
(0, 522) 0.116643751329
(0, 14100) 0.0120665949651
(0, 6851) 0.0723995697903
(0, 13100) 0.144799139581
(0, 14642) 0.0241331899
...
获得的错误是
Traceback (most recent call last):
File "features.py", line 185, in <module>
ob = keywords(['#happy', '#sad', '#feelingsick'])
File "features.py", line 106, in keywords
ax.scatter(X[:, 0], X[:, 1], X[:, 2])
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/mpl_toolkits/mplot3d/axes3d.py", line 2180, in scatter
patches = Axes.scatter(self, xs, ys, s=s, c=c, *args, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axes.py", line 6337, in scatter
self.add_collection(collection)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axes.py", line 1481, in add_collection
self.update_datalim(collection.get_datalim(self.transData))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/collections.py", line 185, in get_datalim
offsets = np.asanyarray(offsets, np.float_)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/numeric.py", line 512, in asanyarray
return array(a, dtype, copy=False, order=order, subok=True)
ValueError: setting an array element with a sequence.
这与here提到的错误几乎相同,但我不确定如何解决它。目标是绘制聚类(而不仅仅是质心)。
提前致谢!