在scikit中使用频谱二象集学习之前需要标准缩放数据?

时间:2018-10-01 09:20:13

标签: python scikit-learn cluster-analysis spectral

嘿,

我有一个来自不同队列的数据集,我想与sklearn function Spectral Biclustering一起使用。 正如您在上面的链接中看到的那样,此方法使用一种归一化方法来计算SVD。

是否有必要在双重聚类之前对数据进行规范化,例如使用StandardScaling(零均值和std为1)吗?,因为上述函数仍使用一种规范化。 是否足够?或者我是否必须对它们进行规范化处理(例如,当数据来自不同的分布时)?

无论是否进行标准缩放,我都会得到不同的结果,并且是否有必要在original paper中找不到信息。

您可以找到代码和我的dataset的示例。这是真实的数据,所以我不知道事实。最后,我计算了consensus score来比较这两个二元组。不幸的是,集群并不相同。

我也使用人工数据进行了尝试(请参见示例最后一个链接),这里的结果是相同的,但实际数据却不一样。

那我怎么知道哪种方法合适呢?

import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.cluster.bicluster import SpectralBiclustering
from sklearn.metrics import consensus_score
from sklearn.preprocessing import StandardScaler

n_clusters = (4, 4)

data_org = pd.read_csv('raw_data_biclustering.csv', sep=',', index_col=0) 


# scale data & transform to dataframe
data_scaled = StandardScaler().fit_transform(data_org)
data_scaled = pd.DataFrame(data_scaled, columns=data_org.columns, index=data_org.index)


# plot original clusters
plt.imshow(data_scaled, aspect='auto', vmin=-3, vmax=5)
plt.title("Original dataset")
plt.show()


data_type = ['none_scaled', 'scaled']
data_all = [data_org, data_scaled]

models_all = []

for name, data in zip(data_type,data_all):

    # spectral biclustering on the shuffled dataset
    model = SpectralBiclustering(n_clusters=n_clusters, method='bistochastic'
                                         , svd_method='randomized', n_jobs=-1
                                         , random_state=0
                                         )
    model.fit(data)


    newOrder_row = [list(r) for r in zip(model.row_labels_, data.index)]
    newOrder_row.sort(key=lambda k: (k[0], k[1]), reverse=False)
    order_row = [i[1] for i in newOrder_row]

    newOrder_col = [list(c) for c in zip(model.column_labels_, [int(x) for x in data.keys()])]
    newOrder_col.sort(key=lambda k: (k[0], k[1]), reverse=False)
    order_col = [i[1] for i in newOrder_col]

    # reorder the data matrix
    X_plot = data_scaled.copy()
    X_plot = X_plot.reindex(order_row) # rows
    X_plot = X_plot[[str(x) for x in order_col]] # columns

    # use clustermap without clustering
    cm=sns.clustermap(X_plot, method=None, metric=None, cmap='viridis'
                  ,row_cluster=False, row_colors=None
                  , col_cluster=False, col_colors=None
                  , yticklabels=1, xticklabels=1
                  , standard_scale=None, z_score=None, robust=False
                  , vmin=-3, vmax=5
                  ) 

    ax = cm.ax_heatmap

    # set labelsize smaller
    cm_ax = plt.gcf().axes[-2]
    cm_ax.tick_params(labelsize=5.5)


    # plot lines for the different clusters
    hor_lines = [sum(item) for item in model.biclusters_[0]]
    hor_lines = list(np.cumsum(hor_lines[::n_clusters[1]]))

    ver_lines = [sum(item) for item in model.biclusters_[1]]
    ver_lines = list(np.cumsum(ver_lines[:n_clusters[0]]))

    for pp in range(len(hor_lines)-1):
        cm.ax_heatmap.hlines(hor_lines[pp],0,X_plot.shape[1], colors='r')

    for pp in range(len(ver_lines)-1):
        cm.ax_heatmap.vlines(ver_lines[pp],0,X_plot.shape[0], colors='r')

    # title
    title = name+' - '+str(n_clusters[1])+'-'+str(n_clusters[0])
    plt.title(title)
    cm.savefig(title,dpi=300)
    plt.show() 

    # save models
    models_all.append(model)

# compare models    
score = consensus_score(models_all[0].biclusters_, models_all[1].biclusters_)
print("consensus score between: {:.1f}".format(score))    

0 个答案:

没有答案