如何利用sklearn为大型数据集利用亲和力传播

时间:2018-03-08 14:32:51

标签: python machine-learning scikit-learn cluster-analysis tf-idf

我曾经亲密传播为我的数据集创建集群,我的数据集非常庞大。我使用tfidf将我的数据集(文本转换为矢量)转换为亲和传播。对于较小的数据集,亲和力可以无缝地工作,但是,对于大型数据集,它开始消耗大量的RAM,最后,我的操作系统终止了这个过程。我做了很多研究,发现很少有StackOverflow,它说使用np.medien或np.mean等。但是对我的问题不起作用。我还原理分析组件并尝试减少矩阵,仍然消耗了大量的RAM。 我发现利用亲和力传播 - 这通过为随机选择的点而不是NXN来处理大数据集,但是在sklearn或python中找不到。 我看到它R语言。 enter link description here

我们如何在python中使用sklearn中的亲和传播。

这是我正在尝试的亲和力代码。

    def affinity_cluster_technique(self, preference=None, X_train_vecs=None, filenames=None, contents=None):
    """

    :param preference:
    :param X_train_vecs:
    :param vectorizer:
    :param filenames:
    :param contents:
    :return:
    """
    logger.info('Into the affinity core engine having the preference {}'.format(str(preference)))
    if X_train_vecs!=None or X_train_vecs!='None':
        X = X_train_vecs
        # X = cosine_distances(X)



        # svd = TruncatedSVD(n_components=100)
        # normalizer = Normalizer(copy=False)
        # lsa = make_pipeline(svd, normalizer)
        # X= X_train_vecs = lsa.fit_transform(X_train_vecs)

        # X = StandardScaler().fit_transform(X)
        logger.info("The shape of X_train after the lsa {}".format(X_train_vecs.shape))
        # X = X_train_vecs.toarray()
        # X = np.array(X)
        logger.info('Vector to array of the X data processed')
        if preference!='None':
            af = AffinityPropagation(damping=0.5, preference=preference,verbose=True)
        else:
            af = AffinityPropagation(damping=0.5, preference=None,verbose=True)
        logger.info('The affinity propagation object is {}'.format(str(af)))
        y = af.fit_predict(X)
        exemplars = af.cluster_centers_
        number_of_clusters = af.labels_.tolist()

        # logger.info("The total number of cluster Affinity generated is: {}".format(str(len(exemplars))))
        data = {'filename': filenames, 'contents': contents, 'cluster_label': number_of_clusters}
        frame = pd.DataFrame(data=data, index=[number_of_clusters], columns=['filename', 'contents', 'cluster_label'])
        logger.info('Sample of the clustered df {}'.format(str(frame.head(2))))
        cluster_and_count_of_docs = frame['cluster_label'].value_counts(sort=True, ascending=False)
        dict_of_cluster_and_filename = dict()

        for i in number_of_clusters:
            list_of_file_id = list()
            list_of_files_in_cluster = list()
            list_of_files_in_cluster = frame.ix[i]['filename'].tolist()
            try:
                for file_id in list_of_files_in_cluster:
                    list_of_file_id.append(file_id)
                dict_of_cluster_and_filename['clusters ' + str(i)] = list_of_file_id
            except Exception as e:
                list_of_file_id.append(list_of_files_in_cluster)
                dict_of_cluster_and_filename['clusters ' + str(i)] = list_of_file_id

        return dict_of_cluster_and_filename, X, y, exemplars
    else:
        try:
            raise Exception
        except Exception as e:
            logger.error("The X_train data is None {}".format(str(e)))
        return dict(),None,None,None

如果我正在做一些事情,请告诉我......并帮助弄清楚如何实现大型数据集。

0 个答案:

没有答案