使用seaborn

时间:2019-02-27 11:44:57

标签: python pandas seaborn bioinformatics

我有一个数据框,其中包含(噬菌体)病毒样基因组之间的所有配对距离。

distData = pd.read_csv("distances.tab",sep="\t")
print(distData.head())
           reference-ID              query-ID  distance  p-value shared-hashes
0  H055-6_SuMu_C01_Hinf  H055-6_SuMu_C01_Hinf  0.000000      0.0     1000/1000
1  H049-6_SuMu_C01_Hinf  H055-6_SuMu_C01_Hinf  0.002000      0.0      921/1000
2  H013-2_SuMu_C01_Hinf  H055-6_SuMu_C01_Hinf  0.010914      0.0      660/1000
3  H081-1_SuMu_C01_Hinf  H055-6_SuMu_C01_Hinf  0.040448      0.0      272/1000
4  H058-3_SuMu_C01_Hinf  H055-6_SuMu_C01_Hinf  0.040310      0.0      273/1000

我想使用熊猫和seaborn来复制距离为方阵 squareDistMatrix 的带有彩色叶子的簇图。

        #Create the heatmap for distances
        squareDistMatrix = pd.pivot_table(distData, values='distance', index=['query-ID'], columns='reference-ID')
        squareDistMatrix.head()

        reference-ID    A001-2_Mu_C10_Aact  A001-3_Aaphi23_C25_Aact     A005-1_B3_C13_Aact  A010-2_B3_C13_Aact  A011-1_B3_C13_Aact  
    query-ID                                                                                    
    A001-2_Mu_C10_Aact  0.000000    1.0     1.000000    0.136948    0.295981    
    A001-3_Aaphi23_C25_Aact     1.000000    0.0     1.000000    1.000000    1.000000 
    A005-1_B3_C13_Aact  1.000000    1.0     0.000000    0.052915    0.050764    
    A010-2_B3_C13_Aact  0.136948    1.0     0.052915    0.000000    0.005942        
    A011-1_B3_C13_Aact  0.295981    1.0     0.050764    0.005942    0.000000    

我关注了以下链接:

Setting col_colors in seaborn clustermap from pandas

https://seaborn.pydata.org/examples/structured_heatmap.html

但是这些链接不是 来聚类映射由保存数据的原始数据帧生成的数据透视表。另外,我想通过每行名称中的子字符串为距离矩阵的叶子着色。那就是所有带有SuMu中间名称的病毒名称都属于相同超集群,属于进化组,因此它们被分配为相同颜色

我用另一个stackoverflow帖子解决了它:

Column colors in clustermap of Python seaborn give unexpected results

我只是将解决方案粘贴给其他感兴趣的用户:

#Function to get the super cluster name from the full name of the phage
#Input: phageFullName "H055-6_SuMu_C01_Hinf" (String)
#Output: superClusterName "SuMu" (String)
def getSuperClusterName(phageFullName):
    return phageFullName.split("_")[1]
uniqueSuperClusters = distData['superCluster'].unique()
numUniqueSuperClusters = len(uniqueSuperClusters)
print("Unique superclusters: " + str(numUniqueSuperClusters))
#create distinct colors one for each supercluster
superClusterPalette = sns.husl_palette(numUniqueSuperClusters, s=.45)

#dict -> uniqueSuperClusterName: uniqueColour
superClusters2ColoursDict = dict(zip(uniqueSuperClusters, superClusterPalette))

#create a list of the supercluster of each row of the square distance matrix
superClusters = distData.superCluster
squareMatrixRows = list(squareDistMatrix.index)
superClusterInSquareMatrix = [] 
for row in squareMatrixRows:
    superClusterInSquareMatrix.append(getSuperClusterName(row))

#create a Series of supercluster as found in the distance matrix
superClusterSeries = pd.Series(superClusterInSquareMatrix)
#map each supercluster to a colour using the dictionary
superClusterColours = pd.Series(superClusterSeries).map(superClusters2ColoursDict)

#clustermap with colors in the rows
sns.clustermap(squareDistMatrix, metric="correlation", method="single", cmap="RdBu_r", standard_scale=1, row_colors=superClusterColours.values, linewidths=.75, figsize=(13, 13))

0 个答案:

没有答案