Question

我目前正在使用scikit-learn在20ng数据集上进行文本分类。我想计算矢量化数据集的信息增益。我可以suggested使用sklearn中的mutual_info_classif来完成此操作。但是，这种方法非常慢，所以我试图根据post来自己实现信息获取。

我提出了以下解决方案：

LogLoss <- function (data, lev = NULL, model = NULL) 
  { 
    obs <- data[, "obs"]
    cls <- levels(obs) #find class names
    probs <- data[, cls[2]] #use second class name
    probs <- pmax(pmin(as.numeric(probs), 1 - 1e-15), 1e-15) #bound probability
    logPreds <- log(probs)        
    log1Preds <- log(1 - probs)
    real <- (as.numeric(data$obs) - 1)
    out <- c(mean(real * logPreds + (1 - real) * log1Preds)) * -1
    names(out) <- c("LogLoss")
    out
  }

使用非常小的数据集，sklearn和我的实现中的大多数分数是相等的。然而，sklearn似乎考虑了频率，我的算法显然没有。例如

from scipy.stats import entropy
import numpy as np

def information_gain(X, y):

    def _entropy(labels):
        counts = np.bincount(labels)
        return entropy(counts, base=None)

    def _ig(x, y):
        # indices where x is set/not set
        x_set = np.nonzero(x)[1]
        x_not_set = np.delete(np.arange(x.shape[1]), x_set)

        h_x_set = _entropy(y[x_set])
        h_x_not_set = _entropy(y[x_not_set])

        return entropy_full - (((len(x_set) / f_size) * h_x_set)
                             + ((len(x_not_set) / f_size) * h_x_not_set))

    entropy_full = _entropy(y)

    f_size = float(X.shape[0])

    scores = np.array([_ig(x, y) for x in X.T])
    return scores

示例输出：

categories = ['talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
                                      categories=categories)

X, y = newsgroups_train.data, newsgroups_train.target
cv = CountVectorizer(max_df=0.95, min_df=2,
                                     max_features=100,
                                     stop_words='english')
X_vec = cv.fit_transform(X)

t0 = time()
res_sk = mutual_info_classif(X_vec, y, discrete_features=True)
print("Time passed for sklearn method: %3f" % (time()-t0))
t0 = time()
res_ig = information_gain(X_vec, y)
print("Time passed for ig: %3f" % (time()-t0))

for name, res_mi, res_ig in zip(cv.get_feature_names(), res_sk, res_ig):
    print("%s: mi=%f, ig=%f" % (name, res_mi, res_ig))

所以我想知道我的实现是错误的，还是正确的，但scikit-learn使用的互信息算法的不同变体。

Answer 1

我的回答有点晚了，但是您应该看一下Orange的实现。在他们的应用程序中，它被用作后台处理器，以帮助告知动态模型参数构建过程。

实现本身看起来相当简单，很可能会被移植出去。首先进行熵计算

从https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L233开始的部分

def _entropy(dist):
    """Entropy of class-distribution matrix"""
    p = dist / np.sum(dist, axis=0)
    pc = np.clip(p, 1e-15, 1)
    return np.sum(np.sum(- p * np.log2(pc), axis=0) * np.sum(dist, axis=0) / np.sum(dist))

然后第二部分。 https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L305

class GainRatio(ClassificationScorer):
    """
    Information gain ratio is the ratio between information gain and
    the entropy of the feature's
    value distribution. The score was introduced in [Quinlan1986]_
    to alleviate overestimation for multi-valued features. See `Wikipedia entry on gain ratio
    <http://en.wikipedia.org/wiki/Information_gain_ratio>`_.
    .. [Quinlan1986] J R Quinlan: Induction of Decision Trees, Machine Learning, 1986.
    """
    def from_contingency(self, cont, nan_adjustment):
        h_class = _entropy(np.sum(cont, axis=1))
        h_residual = _entropy(np.compress(np.sum(cont, axis=0), cont, axis=1))
        h_attribute = _entropy(np.sum(cont, axis=0))
        if h_attribute == 0:
            h_attribute = 1
        return nan_adjustment * (h_class - h_residual) / h_attribute

实际评分过程发生在https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L218

Python信息获得实现

1 个答案: