我正在尝试使用k-means聚类来了解一些推文的主题。我写了以下代码:
from sklearn.cluster import KMeans
import json
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.spatial.distance import euclidean
with open("tweets.json",'r', encoding='utf8') as f:
data = json.loads(f.readline())
noise = '|'.join(['%','#','@','RT','&', r'(?:(?:\d+,?)+(?:\.?\d+)?)',r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+'])
#for item in data:
# print(item['text'].strip('@'))
filtered_data = []
for tweet in data:
filtered_data.append(re.sub(noise, "", tweet['text']))
X = TfidfVectorizer(stop_words='english').fit_transform(filtered_data)
disortions = []
K = range(1, 10)
kmeanmod = KMeans(n_clusters=2).fit(X)
for k in K:
kmeanmod = KMeans(n_clusters=k).fit(X)
kmeanmod.fit(X)
a = np.array(X)
print(a.shape)
b = np.array(kmeanmod.cluster_centers_)
disortions.append(sum(np.min(euclidean(np.array(a), b))))
但是,当我运行此代码时,我收到以下错误:
raise NotImplementedError('adding a nonzero scalar to a '
NotImplementedError: adding a nonzero scalar to a sparse matrix is not supported
我无法找到导致此错误的原因,以及何时更改
disortions.append(sum(np.min(euclidean(np.array(a), b))))
进入
disortions.append(sum(np.min(cdist(np.array(a), b))))
我收到此错误:
raise ValueError('XA must be a 2-dimensional array.')
ValueError: XA must be a 2-dimensional array.
有谁知道如何解决这个问题?