使用sklearn在隔离林算法中设置max features参数时出错

时间:2017-05-29 08:27:25

标签: algorithm python-3.x scikit-learn random-forest

我尝试使用Isolation Forest sklearn implementation训练具有357个功能的数据集。当max features变量设置为1.0(默认值)时,我可以成功训练并获得结果。

但是,当max features设置为2时,会出现以下错误:

ValueError: Number of features of the model must match the input. 
Model n_features is 2 and input n_features is 357

enter image description here

当要素计数为1(整数)而非1.0(浮点数)时,它也会出现相同的错误。

我的理解是,当要素计数为2(int)时,在创建每棵树时应考虑两个特征。这是错的吗?如何更改max features参数?

代码如下:

from sklearn.ensemble.iforest import IsolationForest

def isolation_forest_imp(dataset):

    estimators = 10
    samples = 100
    features = 2
    contamination = 0.1
    bootstrap = False
    random_state = None
    verbosity = 0

    estimator = IsolationForest(n_estimators=estimators, max_samples=samples, contamination=contamination,
                                     max_features=features,
                                     bootstrap=boostrap, random_state=random_state, verbose=verbosity)

    model = estimator.fit(dataset)

1 个答案:

答案 0 :(得分:0)

在文件中说明:

max_features : int or float, optional (default=1.0)
    The number of features to draw from X to train each base estimator.

        - If int, then draw `max_features` features.
        - If float, then draw `max_features * X.shape[1]` features.

所以,2应该意味着采取两个功能,1.0应该意味着采取所有的功能,0.5取一半,等等,从我的理解。

我认为这可能是一个错误,因为,看看IsolationForest的合适:

# Isolation Forest inherits from BaseBagging 
# and when _fit is called, BaseBagging takes care of the features correctly
super(IsolationForest, self)._fit(X, y, max_samples,
                                          max_depth=max_depth,
                                          sample_weight=sample_weight)
 # however, when after _fit the decision_function is called using X - the whole sample - not taking into account the max_features
 self.threshold_ = -sp.stats.scoreatpercentile(
            -self.decision_function(X), 100. * (1. - self.contamination))

然后:

    # when the decision function _validate_X_predict is called, with X unmodified, 
    # it calls the base estimator's (dt) _validate_X_predict with the whole X
    X = self.estimators_[0]._validate_X_predict(X, check_input=True)

   ... 

   # from tree.py:
   def _validate_X_predict(self, X, check_input):
        """Validate X whenever one tries to predict, apply, predict_proba"""
        if self.tree_ is None:
            raise NotFittedError("Estimator not fitted, "
                                 "call `fit` before exploiting the model.")

        if check_input:
            X = check_array(X, dtype=DTYPE, accept_sparse="csr")
            if issparse(X) and (X.indices.dtype != np.intc or
                                X.indptr.dtype != np.intc):
                raise ValueError("No support for np.int64 index based "
                                 "sparse matrices")
        # so, this check fails because X is the original X, not with the max_features applied
        n_features = X.shape[1]
        if self.n_features_ != n_features:
            raise ValueError("Number of features of the model must "
                             "match the input. Model n_features is %s and "
                             "input n_features is %s "
                             % (self.n_features_, n_features))

        return X

所以,我不知道你如何处理这个问题。也许找出导致你需要的两个功能的百分比 - 即使我不确定它是否会像预期的那样工作。

注意:我使用的是scikit-learn v.0.18

编辑:正如@Vivek Kumar评论说这是一个问题,升级到0.20应该可以解决问题。