使用sklearn.AffinityPropagation输出标签

时间:2013-04-12 10:10:59

标签: machine-learning bioinformatics scikit-learn

我有一组数据是1000个同源蛋白质序列的距离矩阵。

我已经设法为此计算亲和度矩阵(简单计算:1 - 距离,在我的情况下)。

基本上,如果在Excel中查看数据,没有标题行,第1列是序列名称,接下来的1000列是距离值。

我修改了sklearn的Affinity Propagation网站上提供的代码。这就是它现在的样子:

print __doc__

import numpy as np
from sklearn.cluster import AffinityPropagation
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
import csv

##############################################################################
f = open('ha-sequences-sample-distmat2.csv', 'rU')
csvreader = csv.reader(f)

sequence_names = []
distance_matrix = []
full_data = []

for row in csvreader:
#   print row

    sequence_names.append(row[0])
    distance_matrix.append(row[1:])
    full_data.append(row)

f.close()

distmat = np.array([row for row in distance_matrix]).astype(np.float)

# print distmat

affinity_matrix = np.array([1 - row for row in distmat]).astype(np.float)

full_matrix = zip(sequence_names, affinity_matrix)

# print affinity_matrix, sequence_names




##############################################################################
# Compute Affinity Propagation
af = AffinityPropagation(affinity='precomputed').fit(affinity_matrix)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_

n_clusters_ = len(cluster_centers_indices)

print 'Estimated number of clusters: %d' % n_clusters_
print "Homogeneity: %0.3f" % metrics.homogeneity_score(sequence_names, labels)
print "Completeness: %0.3f" % metrics.completeness_score(sequence_names, labels)
print "V-measure: %0.3f" % metrics.v_measure_score(sequence_names, labels)
print "Adjusted Rand Index: %0.3f" % \
    metrics.adjusted_rand_score(sequence_names, labels)
print("Adjusted Mutual Information: %0.3f" %
      metrics.adjusted_mutual_info_score(sequence_names, labels))
print("Silhouette Coefficient: %0.3f" %
      metrics.silhouette_score(affinity_matrix, labels, metric='sqeuclidean'))

##############################################################################
# Plot result
import pylab as pl
from itertools import cycle

pl.close('all')
pl.figure(1)
pl.clf()

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
    class_members = labels == k
    cluster_center = affinity_matrix[cluster_centers_indices[k]]
    pl.plot(affinity_matrix[class_members, 0], affinity_matrix[class_members, 1], col + '.')
    pl.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
            markeredgecolor='k', markersize=14)
    for x in affinity_matrix[class_members]:
        pl.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

pl.title('Estimated number of clusters: %d' % n_clusters_)
pl.show()

我遇到的问题是:我无法弄清楚如何输出与每个群集对应的序列名称。如果我可以将汇集在一起​​的序列输出到shell并在图上显示聚类数字,那将是最好的,但即使我没有在情节上显示内容,这也很酷。

有人知道怎么做吗?

1 个答案:

答案 0 :(得分:5)

您有序列名称列表(sequence_names)和一组簇标签(af.labels_)。因此,您可以循环遍历集群标签数组,并从序列名称的集群标签列表中保留映射。例如

#for a simple example, assume the names and cluster labels are predefined
sequence_names = ["a", "b", "c", "d"]
labels = [0,1,1,0]

from collections import defaultdict
clusternames = defaultdict(list)

for i, label in enumerate(labels):
    clusternames[label].append(sequence_names[i])

#clusternames now holds a map from cluster label to list of sequence names
#Print out the label with the list 
for k, v in clusternames.items():
    print k, v