我有一个数据框,其中包含(噬菌体)病毒样基因组之间的所有配对距离。
distData = pd.read_csv("distances.tab",sep="\t")
print(distData.head())
reference-ID query-ID distance p-value shared-hashes
0 H055-6_SuMu_C01_Hinf H055-6_SuMu_C01_Hinf 0.000000 0.0 1000/1000
1 H049-6_SuMu_C01_Hinf H055-6_SuMu_C01_Hinf 0.002000 0.0 921/1000
2 H013-2_SuMu_C01_Hinf H055-6_SuMu_C01_Hinf 0.010914 0.0 660/1000
3 H081-1_SuMu_C01_Hinf H055-6_SuMu_C01_Hinf 0.040448 0.0 272/1000
4 H058-3_SuMu_C01_Hinf H055-6_SuMu_C01_Hinf 0.040310 0.0 273/1000
我想使用熊猫和seaborn来复制距离为方阵 squareDistMatrix 的带有彩色叶子的簇图。
#Create the heatmap for distances
squareDistMatrix = pd.pivot_table(distData, values='distance', index=['query-ID'], columns='reference-ID')
squareDistMatrix.head()
reference-ID A001-2_Mu_C10_Aact A001-3_Aaphi23_C25_Aact A005-1_B3_C13_Aact A010-2_B3_C13_Aact A011-1_B3_C13_Aact
query-ID
A001-2_Mu_C10_Aact 0.000000 1.0 1.000000 0.136948 0.295981
A001-3_Aaphi23_C25_Aact 1.000000 0.0 1.000000 1.000000 1.000000
A005-1_B3_C13_Aact 1.000000 1.0 0.000000 0.052915 0.050764
A010-2_B3_C13_Aact 0.136948 1.0 0.052915 0.000000 0.005942
A011-1_B3_C13_Aact 0.295981 1.0 0.050764 0.005942 0.000000
我关注了以下链接:
Setting col_colors in seaborn clustermap from pandas
https://seaborn.pydata.org/examples/structured_heatmap.html
但是这些链接不是 来聚类映射由保存数据的原始数据帧生成的数据透视表。另外,我想通过每行名称中的子字符串为距离矩阵的叶子着色。那就是所有带有SuMu中间名称的病毒名称都属于相同超集群,属于进化组,因此它们被分配为相同颜色。
我用另一个stackoverflow帖子解决了它:
Column colors in clustermap of Python seaborn give unexpected results
我只是将解决方案粘贴给其他感兴趣的用户:
#Function to get the super cluster name from the full name of the phage
#Input: phageFullName "H055-6_SuMu_C01_Hinf" (String)
#Output: superClusterName "SuMu" (String)
def getSuperClusterName(phageFullName):
return phageFullName.split("_")[1]
uniqueSuperClusters = distData['superCluster'].unique()
numUniqueSuperClusters = len(uniqueSuperClusters)
print("Unique superclusters: " + str(numUniqueSuperClusters))
#create distinct colors one for each supercluster
superClusterPalette = sns.husl_palette(numUniqueSuperClusters, s=.45)
#dict -> uniqueSuperClusterName: uniqueColour
superClusters2ColoursDict = dict(zip(uniqueSuperClusters, superClusterPalette))
#create a list of the supercluster of each row of the square distance matrix
superClusters = distData.superCluster
squareMatrixRows = list(squareDistMatrix.index)
superClusterInSquareMatrix = []
for row in squareMatrixRows:
superClusterInSquareMatrix.append(getSuperClusterName(row))
#create a Series of supercluster as found in the distance matrix
superClusterSeries = pd.Series(superClusterInSquareMatrix)
#map each supercluster to a colour using the dictionary
superClusterColours = pd.Series(superClusterSeries).map(superClusters2ColoursDict)
#clustermap with colors in the rows
sns.clustermap(squareDistMatrix, metric="correlation", method="single", cmap="RdBu_r", standard_scale=1, row_colors=superClusterColours.values, linewidths=.75, figsize=(13, 13))