Python scikit-学习不适用于不同大小的分区的互信息的实现

时间:2017-08-01 17:25:26

标签: python scikit-learn

我想比较不同大小的S组的分区/聚类(P1和P2)。例如:

S = [1, 2, 3, 4, 5, 6]
P1 = [[1, 2], [3,4], [5,6]]
P2 = [ [1,2,3,4], [5, 6]]

从我所看到的,互信息可能是一种方法,它是在scikit-learn中实现的。根据定义,它没有说明分区的大小必须相同(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html).l

然而,当我尝试实现我的代码时,由于大小不同而导致错误。

from sklearn import metrics
P1 = [[1, 2], [3,4], [5,6]]
P2 = [ [1,2,3,4], [5, 6]]
metrics.mutual_info_score(P1,P2)


ValueErrorTraceback (most recent call last)
<ipython-input-183-d5cb8d32ce7d> in <module>()
      2 P2 = [ [1,2,3,4], [5, 6]]
      3 
----> 4 metrics.mutual_info_score(P1,P2)

/home/user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/cluster/supervised.pyc in mutual_info_score(labels_true, labels_pred, contingency)
    556     """
    557     if contingency is None:
--> 558         labels_true, labels_pred = check_clusterings(labels_true, labels_pred)
    559         contingency = contingency_matrix(labels_true, labels_pred)
    560     contingency = np.array(contingency, dtype='float')

/home/user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/cluster/supervised.pyc in check_clusterings(labels_true, labels_pred)
     34     if labels_true.ndim != 1:
     35         raise ValueError(
---> 36             "labels_true must be 1D: shape is %r" % (labels_true.shape,))
     37     if labels_pred.ndim != 1:
     38         raise ValueError(

ValueError: labels_true must be 1D: shape is (3, 2)

是否有表格使用scikit-learn和互动信息来查看这些分区的接近程度?否则,有没有使用互信息?

1 个答案:

答案 0 :(得分:0)

错误的形式是信息传递给函数。正确的表单是为要分区的全局集的每个元素提供标签列表。在这种情况下,S中每个元素的一个标签。每个标签应该对应于它所属的集群,因此具有相同标签的元素在同一个集群中。要解决这个例子:

S = [1, 2, 3, 4, 5, 6]
P1 = [[1, 2], [3,4], [5,6]]
P2 = [ [1,2,3,4], [5, 6]]
labs_1 = [ 1, 1, 2, 2, 3, 3]
labs_2 = [1, 1, 1, 1, 2, 2]
metrics.mutual_info_score(labs_1, labs_2)

答案是:

0.636514168294813

如果我们想要计算最初给出的分区格式的互信息,那么可以使用以下代码:

from sklearn import metrics
from __future__ import division
import numpy as np

S = [1, 2, 3, 4, 5, 6]
P1 = [[1, 2], [3,4], [5,6]]
P2 = [ [1,2,3,4], [5, 6]]
set_partition1 = [set(p) for p in P1]
set_partition2 = [set(p) for p in P2]

def prob_dist(clustering, cluster, N):
    return len(clustering[cluster])/N

def prob_joint_dist(clustering1, clustering2, cluster1, cluster2, N):
    '''
    N(int): total number of elements.
    clustering1(list): first partition
    clustering2(list): second partition
    cluster1(int): index of cluster of the first partition
    cluster2(int): index of cluster of second partition
    '''
    c1 = clustering1[cluster1]
    c2 = clustering2[cluster2]
    n_ij = len(set(c1).intersection(c2))
    return n_ij/N

def mutual_info(clustering1, clustering2, N):
    '''
    clustering1(list): first partition
    clustering2(list): second partition
    Note for it to work division from  __future__ must be imported
    '''
    n_clas = len(clustering1)
    n_com = len(clustering2)
    mutual_info = 0
    for i in range(n_clas):
        for j in range(n_com):
            p_i = prob_dist(clustering1, i, N)
            p_j = prob_dist(clustering2, j, N)
            R_ij = prob_joint_dist(clustering1, clustering2, i, j, N)
            if R_ij > 0:
                mutual_info += R_ij*np.log( R_ij / (p_i * p_j))
    return mutual_info

mutual_info(set_partition1, set_partition2, len(S))

给出了相同的答案:

0.63651416829481278

请注意,我们使用的是自然对数而不是log2。代码可以很容易地进行调整。