Question

我有一个数据集，每个数据都有稀疏标签。因此，下面是数据的样子。

[[[“雪”，“冬季”，“冻结”，“乐趣”，“豆豆”，“鞋类”，“头饰”，“毛皮”，“在雪地里玩”，“摄影”]，[ “树”，“天空”，“白天”，“城市区域”，“分支”，“城市区域”，“冬季”，“城市”，“城市”，“路灯”]，...]

标签总数约为50，数据数量为200K。我想对这些数据进行聚类，但是在处理时遇到了麻烦。

我想用四种聚类算法（AgglomerativeClustering，SpectralClustering，MiniBatchKMeans，KMeans）对数据进行聚类，但是由于内存问题，这些方法都不起作用。

下面是我的代码。

from scipy.sparse import csr_matrix
from sklearn.cluster import KMeans
from sklearn.cluster import MiniBatchKMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import SpectralClustering
import json

NUM_OF_CLUSTERS = 10

with open('./data/sample.json') as json_file:
    json_data = json.load(json_file)
indptr = [0]
indices = []
data = []
vocabulary = {}
for d in json_data:
    for term in d:
        index = vocabulary.setdefault(term, len(vocabulary))
        indices.append(index)
        data.append(1)
    indptr.append(len(indices))

X = csr_matrix((data, indices, indptr), dtype=int).toarray()

# None of these algorithms work properly. I think it's because of memory issues.
# miniBatchKMeans = MiniBatchKMeans(n_clusters=NUM_OF_CLUSTERS, n_init=5, random_state=0).fit(X)
# agglomerative = AgglomerativeClustering(n_clusters=NUM_OF_CLUSTERS).fit(X)
# spectral = SpectralClustering(n_clusters=NUM_OF_CLUSTERS, assign_labels="discretize", random_state=0).fit(X)
#
# print(miniBatchKMeans.labels_)
# print(agglomerative.labels_)
# print(spectral.labels_)
with open('data.json', 'w') as outfile:
    json.dump(miniBatchKMeans.labels_.tolist(), outfile)

有没有针对我的问题的解决方案或其他建议？

Answer 1

X的大小是多少？

使用toarray()，您正在将数据转换为有义格式。这大大增加了内存需求。

对于20万个实例，您不能使用频谱聚类而不是亲和力传播，因为它们需要O（n²）内存。因此，您可以选择其他算法或对数据进行子采样。显然，在进行kmeans和minibatch kmeans时都没有用（这是kmeans的近似值）。只能使用一个。

要有效处理稀疏数据，您可能需要自己实现算法。 Kmeans是专为密集数据而设计的，因此默认情况下计时实现密集数据是有意义的。实际上，对稀疏数据使用 mean 颇有疑问。因此，我也不希望在使用kmeans的数据上得到很好的结果。

处理内存错误（Python sklearn群集）

1 个答案: