我想比较不同大小的S组的分区/聚类(P1和P2)。例如:
S = [1, 2, 3, 4, 5, 6]
P1 = [[1, 2], [3,4], [5,6]]
P2 = [ [1,2,3,4], [5, 6]]
从我所看到的,互信息可能是一种方法,它是在scikit-learn中实现的。根据定义,它没有说明分区的大小必须相同(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html).l
然而,当我尝试实现我的代码时,由于大小不同而导致错误。
from sklearn import metrics
P1 = [[1, 2], [3,4], [5,6]]
P2 = [ [1,2,3,4], [5, 6]]
metrics.mutual_info_score(P1,P2)
ValueErrorTraceback (most recent call last)
<ipython-input-183-d5cb8d32ce7d> in <module>()
2 P2 = [ [1,2,3,4], [5, 6]]
3
----> 4 metrics.mutual_info_score(P1,P2)
/home/user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/cluster/supervised.pyc in mutual_info_score(labels_true, labels_pred, contingency)
556 """
557 if contingency is None:
--> 558 labels_true, labels_pred = check_clusterings(labels_true, labels_pred)
559 contingency = contingency_matrix(labels_true, labels_pred)
560 contingency = np.array(contingency, dtype='float')
/home/user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/cluster/supervised.pyc in check_clusterings(labels_true, labels_pred)
34 if labels_true.ndim != 1:
35 raise ValueError(
---> 36 "labels_true must be 1D: shape is %r" % (labels_true.shape,))
37 if labels_pred.ndim != 1:
38 raise ValueError(
ValueError: labels_true must be 1D: shape is (3, 2)
是否有表格使用scikit-learn和互动信息来查看这些分区的接近程度?否则,有没有使用互信息?
答案 0 :(得分:0)
错误的形式是信息传递给函数。正确的表单是为要分区的全局集的每个元素提供标签列表。在这种情况下,S中每个元素的一个标签。每个标签应该对应于它所属的集群,因此具有相同标签的元素在同一个集群中。要解决这个例子:
S = [1, 2, 3, 4, 5, 6]
P1 = [[1, 2], [3,4], [5,6]]
P2 = [ [1,2,3,4], [5, 6]]
labs_1 = [ 1, 1, 2, 2, 3, 3]
labs_2 = [1, 1, 1, 1, 2, 2]
metrics.mutual_info_score(labs_1, labs_2)
答案是:
0.636514168294813
如果我们想要计算最初给出的分区格式的互信息,那么可以使用以下代码:
from sklearn import metrics
from __future__ import division
import numpy as np
S = [1, 2, 3, 4, 5, 6]
P1 = [[1, 2], [3,4], [5,6]]
P2 = [ [1,2,3,4], [5, 6]]
set_partition1 = [set(p) for p in P1]
set_partition2 = [set(p) for p in P2]
def prob_dist(clustering, cluster, N):
return len(clustering[cluster])/N
def prob_joint_dist(clustering1, clustering2, cluster1, cluster2, N):
'''
N(int): total number of elements.
clustering1(list): first partition
clustering2(list): second partition
cluster1(int): index of cluster of the first partition
cluster2(int): index of cluster of second partition
'''
c1 = clustering1[cluster1]
c2 = clustering2[cluster2]
n_ij = len(set(c1).intersection(c2))
return n_ij/N
def mutual_info(clustering1, clustering2, N):
'''
clustering1(list): first partition
clustering2(list): second partition
Note for it to work division from __future__ must be imported
'''
n_clas = len(clustering1)
n_com = len(clustering2)
mutual_info = 0
for i in range(n_clas):
for j in range(n_com):
p_i = prob_dist(clustering1, i, N)
p_j = prob_dist(clustering2, j, N)
R_ij = prob_joint_dist(clustering1, clustering2, i, j, N)
if R_ij > 0:
mutual_info += R_ij*np.log( R_ij / (p_i * p_j))
return mutual_info
mutual_info(set_partition1, set_partition2, len(S))
给出了相同的答案:
0.63651416829481278
请注意,我们使用的是自然对数而不是log2。代码可以很容易地进行调整。