我曾经亲密传播为我的数据集创建集群,我的数据集非常庞大。我使用tfidf将我的数据集(文本转换为矢量)转换为亲和传播。对于较小的数据集,亲和力可以无缝地工作,但是,对于大型数据集,它开始消耗大量的RAM,最后,我的操作系统终止了这个过程。我做了很多研究,发现很少有StackOverflow,它说使用np.medien或np.mean等。但是对我的问题不起作用。我还原理分析组件并尝试减少矩阵,仍然消耗了大量的RAM。 我发现利用亲和力传播 - 这通过为随机选择的点而不是NXN来处理大数据集,但是在sklearn或python中找不到。 我看到它R语言。 enter link description here
我们如何在python中使用sklearn中的亲和传播。
这是我正在尝试的亲和力代码。
def affinity_cluster_technique(self, preference=None, X_train_vecs=None, filenames=None, contents=None):
"""
:param preference:
:param X_train_vecs:
:param vectorizer:
:param filenames:
:param contents:
:return:
"""
logger.info('Into the affinity core engine having the preference {}'.format(str(preference)))
if X_train_vecs!=None or X_train_vecs!='None':
X = X_train_vecs
# X = cosine_distances(X)
# svd = TruncatedSVD(n_components=100)
# normalizer = Normalizer(copy=False)
# lsa = make_pipeline(svd, normalizer)
# X= X_train_vecs = lsa.fit_transform(X_train_vecs)
# X = StandardScaler().fit_transform(X)
logger.info("The shape of X_train after the lsa {}".format(X_train_vecs.shape))
# X = X_train_vecs.toarray()
# X = np.array(X)
logger.info('Vector to array of the X data processed')
if preference!='None':
af = AffinityPropagation(damping=0.5, preference=preference,verbose=True)
else:
af = AffinityPropagation(damping=0.5, preference=None,verbose=True)
logger.info('The affinity propagation object is {}'.format(str(af)))
y = af.fit_predict(X)
exemplars = af.cluster_centers_
number_of_clusters = af.labels_.tolist()
# logger.info("The total number of cluster Affinity generated is: {}".format(str(len(exemplars))))
data = {'filename': filenames, 'contents': contents, 'cluster_label': number_of_clusters}
frame = pd.DataFrame(data=data, index=[number_of_clusters], columns=['filename', 'contents', 'cluster_label'])
logger.info('Sample of the clustered df {}'.format(str(frame.head(2))))
cluster_and_count_of_docs = frame['cluster_label'].value_counts(sort=True, ascending=False)
dict_of_cluster_and_filename = dict()
for i in number_of_clusters:
list_of_file_id = list()
list_of_files_in_cluster = list()
list_of_files_in_cluster = frame.ix[i]['filename'].tolist()
try:
for file_id in list_of_files_in_cluster:
list_of_file_id.append(file_id)
dict_of_cluster_and_filename['clusters ' + str(i)] = list_of_file_id
except Exception as e:
list_of_file_id.append(list_of_files_in_cluster)
dict_of_cluster_and_filename['clusters ' + str(i)] = list_of_file_id
return dict_of_cluster_and_filename, X, y, exemplars
else:
try:
raise Exception
except Exception as e:
logger.error("The X_train data is None {}".format(str(e)))
return dict(),None,None,None
如果我正在做一些事情,请告诉我......并帮助弄清楚如何实现大型数据集。