scikit-learn在另一个特征中对名义值组内的特征进行估算

时间:2017-03-10 17:12:14

标签: machine-learning scikit-learn classification mean imputation

我想估算一个特征的均值,但只计算基于另一列中具有相同类别/名义值的其他示例的均值,我想知道这是否可能使用scikit-learn的Imputer类?这样就可以更容易地以这种方式添加到管道中。

例如:

使用来自kaggle的泰坦尼克数据集:source

我如何根据fare计算平均值pclass。其背后的想法是,不同班级的人在门票之间的成本差异很大。

更新:在与某些人讨论后,我应该使用的短语是"在课堂上 "

我已经查看了下面Vivek的评论,并且当我有时间做我想做的事情时,我会构建一个通用的管道功能:)我很清楚如何做到这一点并在完成后发布作为答案。

1 个答案:

答案 0 :(得分:0)

以下是我的问题的一个非常简单的方法,只是为了处理事物的方法。一个更强大的实现可能涉及利用scikit学习的Imputer类,这意味着它也可以做模式,中位数等,并且在处理稀疏/密集矩阵时会更好。

这是基于Vivek Kumar对原始问题的评论,该问题建议将数据拆分成堆栈并以此方式对其进行修改然后重新组装它们。

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

class WithinClassMeanImputer(BaseEstimator, TransformerMixin):
    def __init__(self, replace_col_index, class_col_index = None, missing_values=np.nan):
        self.missing_values = missing_values
        self.replace_col_index = replace_col_index
        self.y = None
        self.class_col_index = class_col_index

    def fit(self, X, y = None):
        self.y = y
        return self

    def transform(self, X):
        y = self.y
        classes = np.unique(y)
        stacks = []

        if len(X) > 1 and len(self.y) = len(X):
            if( self.class_col_index == None ):
                # If we're using the dependent variable
                for aclass in classes:
                    with_missing = X[(y == aclass) & 
                                        (X[:, self.replace_col_index] == self.missing_values)]
                    without_missing = X[(y == aclass) & 
                                            (X[:, self.replace_col_index] != self.missing_values)]

                    column = without_missing[:, self.replace_col_index]
                    # Calculate mean from examples without missing values
                    mean = np.mean(column[without_missing[:, self.replace_col_index] != self.missing_values])

                    # Broadcast mean to all missing values
                    with_missing[:, self.replace_col_index] = mean

                    stacks.append(np.concatenate((with_missing, without_missing)))
            else:
                # If we're using nominal values within a binarised feature (i.e. the classes
                # are unique values within a nominal column - e.g. sex)
                for aclass in classes:
                    with_missing = X[(X[:, self.class_col_index] == aclass) & 
                                        (X[:, self.replace_col_index] == self.missing_values)]
                    without_missing = X[(X[:, self.class_col_index] == aclass) & 
                                            (X[:, self.replace_col_index] != self.missing_values)]

                    column = without_missing[:, self.replace_col_index]
                    # Calculate mean from examples without missing values
                    mean = np.mean(column[without_missing[:, self.replace_col_index] != self.missing_values])

                    # Broadcast mean to all missing values
                    with_missing[:, self.replace_col_index] = mean
                    stacks.append(np.concatenate((with_missing, without_missing)))

            if len(stacks) > 1 :
                # Reassemble our stacks of values
                X = np.concatenate(stacks)

        return X