Question

假设我想通过交叉验证和使用pipeline类来比较特定（受监督）数据集的不同降维方法，该数据集包含n> 2个特征。

例如，如果我想尝试使用PCA和LDA，我可以做类似的事情：

from sklearn.cross_validation import cross_val_score, KFold
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.lda import LDA
from sklearn.decomposition import PCA

clf_all = Pipeline(steps=[
    ('scaler', StandardScaler()),           
    ('classification', GaussianNB())   
    ])

clf_pca = Pipeline(steps=[
    ('scaler', StandardScaler()),    
    ('reduce_dim', PCA(n_components=2)),
    ('classification', GaussianNB())   
    ])

clf_lda = Pipeline(steps=[
    ('scaler', StandardScaler()), 
    ('reduce_dim', LDA(n_components=2)),
    ('classification', GaussianNB())   
    ])

# Constructing the k-fold cross validation iterator (k=10)  

cv = KFold(n=X_train.shape[0],  # total number of samples
           n_folds=10,           # number of folds the dataset is divided into
           shuffle=True,
           random_state=123)

scores = [
    cross_val_score(clf, X_train, y_train, cv=cv, scoring='accuracy')
            for clf in [clf_all, clf_pca, clf_lda]
    ]

但是现在，让我们说 - 根据一些“领域知识” - 我假设特征3＆amp; 4可能是“好的功能”（数组X_train的第三和第四列），我想将它们与其他方法进行比较。

我如何在pipeline？

中包含这样的手动功能选择

例如

def select_3_and_4(X_train):
    return X_train[:,2:4]

clf_all = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('feature_select', select_3_and_4),           
    ('classification', GaussianNB())   
    ])

显然不起作用。

所以我假设我必须创建一个特征选择类，它具有transform虚拟方法和fit方法，它返回numpy数组的两列？或者有更好的方法吗？

Answer 1

我只想发布完整性的解决方案，也许这对其中一个有用：

class ColumnExtractor(object):

    def transform(self, X):
        cols = X[:,2:4] # column 3 and 4 are "extracted"
        return cols

    def fit(self, X, y=None):
        return self

然后，它可以像Pipeline那样使用：

clf = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('reduce_dim', ColumnExtractor()),           
    ('classification', GaussianNB())   
    ])

编辑：一般解决方案

对于更通用的解决方案，如果要选择并堆叠多个列，基本上可以使用以下类：

import numpy as np

class ColumnExtractor(object):

    def __init__(self, cols):
        self.cols = cols

    def transform(self, X):
        col_list = []
        for c in self.cols:
            col_list.append(X[:, c:c+1])
        return np.concatenate(col_list, axis=1)

    def fit(self, X, y=None):
        return self

    clf = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('dim_red', ColumnExtractor(cols=(1,3))),   # selects the second and 4th column      
    ('classification', GaussianNB())   
    ])

Answer 2

添加Sebastian Raschka和eickenberg的答案，变换器对象应该持有的要求在scikit-learn documentation中指定。

如果您希望估算器可用于参数估计，例如实现set_params，则还有一些要求，而不仅仅是拟合和变换。

Answer 3

如果你想使用Pipeline对象，那么是的，干净的方法是写一个变换器对象。肮脏的方法是

select_3_and_4.transform = select_3_and_4.__call__
select_3_and_4.fit = lambda x: select_3_and_4

并在管道中使用select_3_and_4。你显然也可以写一堂课。

否则，如果您知道其他功能无关紧要，您也可以将X_train[:, 2:4]提供给您的管道。

数据驱动的功能选择工具可能是偏离主题的，但总是有用的：检查例如sklearn.feature_selection.SelectKBest使用sklearn.feature_selection.f_classif或sklearn.feature_selection.f_regression与{在你的情况下k=2。

Answer 4

我没有找到明确的答案，因此这是我为他人提供的解决方案。基本上，这个想法是基于BaseEstimator和TransformerMixin

创建一个新类

以下是基于列中NA百分比的功能选择器。 perc值对应于NA的百分比。

from sklearn.base import TransformerMixin, BaseEstimator

class NonNAselector(BaseEstimator, TransformerMixin):

    """Extract columns with less than x percentage NA to impute further in the line
    Class to use in the pipline
    -----
    attributes 
    fit : identify columns - in the training set
    transfer : only uise those columns
    """

    def __init__(self, perc=0.1):
        self.perc = perc
        self.columns_with_less_than_x_na_id = None

    def fit(self, X, y=None):
        self.columns_with_less_than_x_na_id = (X.isna().sum()/X.shape[0] < self.perc).index.tolist()
        return self

    def transform(self, X, y=None, **kwargs):
        return X[self.columns_with_less_than_x_na_id]

    def get_params(self, deep=False):
        return {"perc": self.perc}

Answer 5

您可以使用以下自定义转换器选择指定的列：

#Custom Transformer，提取作为参数传递给其构造函数的列

func executeTask() -> TaskStatus {
  // build and run a task
  let task = Process()
  ...
  task.launch()

  let status: T = task.terminationStatus
    
  return TaskStatus(with: status)
}

在这里feature_names是要选择的功能列表有关更多详细信息，您可以参考此链接 [1]：https://towardsdatascience.com/custom-transformers-and-ml-data-pipelines-with-python-20ea2a7adb65

Answer 6

另一种方法是简单地将 ColumnTransformer 与 «empty» FunctionTransformer 一起使用：

# a FunctionTransformer with func=None yields the identity function / passthrough 
empty_func = make_pipeline(FunctionTransformer(func=None)) 

clf_all = make_pipeline(StandardScaler(), 
                        ColumnTransformer([("select", empty_func, [3, 4])]),
                        GaussianNB(),
                        )

这是因为 ColumnTransformer by default drops the remainder of columns that aren't selected.

如何在scikit-learn的“管道”中使用自定义功能选择功能

6 个答案:

编辑：一般解决方案