我尝试使用Isolation Forest sklearn implementation训练具有357个功能的数据集。当max features变量设置为1.0(默认值)时,我可以成功训练并获得结果。
但是,当max features设置为2时,会出现以下错误:
ValueError: Number of features of the model must match the input.
Model n_features is 2 and input n_features is 357
当要素计数为1(整数)而非1.0(浮点数)时,它也会出现相同的错误。
我的理解是,当要素计数为2(int)时,在创建每棵树时应考虑两个特征。这是错的吗?如何更改max features参数?
代码如下:
from sklearn.ensemble.iforest import IsolationForest
def isolation_forest_imp(dataset):
estimators = 10
samples = 100
features = 2
contamination = 0.1
bootstrap = False
random_state = None
verbosity = 0
estimator = IsolationForest(n_estimators=estimators, max_samples=samples, contamination=contamination,
max_features=features,
bootstrap=boostrap, random_state=random_state, verbose=verbosity)
model = estimator.fit(dataset)
答案 0 :(得分:0)
在文件中说明:
max_features : int or float, optional (default=1.0)
The number of features to draw from X to train each base estimator.
- If int, then draw `max_features` features.
- If float, then draw `max_features * X.shape[1]` features.
所以,2应该意味着采取两个功能,1.0应该意味着采取所有的功能,0.5取一半,等等,从我的理解。
我认为这可能是一个错误,因为,看看IsolationForest的合适:
# Isolation Forest inherits from BaseBagging
# and when _fit is called, BaseBagging takes care of the features correctly
super(IsolationForest, self)._fit(X, y, max_samples,
max_depth=max_depth,
sample_weight=sample_weight)
# however, when after _fit the decision_function is called using X - the whole sample - not taking into account the max_features
self.threshold_ = -sp.stats.scoreatpercentile(
-self.decision_function(X), 100. * (1. - self.contamination))
然后:
# when the decision function _validate_X_predict is called, with X unmodified,
# it calls the base estimator's (dt) _validate_X_predict with the whole X
X = self.estimators_[0]._validate_X_predict(X, check_input=True)
...
# from tree.py:
def _validate_X_predict(self, X, check_input):
"""Validate X whenever one tries to predict, apply, predict_proba"""
if self.tree_ is None:
raise NotFittedError("Estimator not fitted, "
"call `fit` before exploiting the model.")
if check_input:
X = check_array(X, dtype=DTYPE, accept_sparse="csr")
if issparse(X) and (X.indices.dtype != np.intc or
X.indptr.dtype != np.intc):
raise ValueError("No support for np.int64 index based "
"sparse matrices")
# so, this check fails because X is the original X, not with the max_features applied
n_features = X.shape[1]
if self.n_features_ != n_features:
raise ValueError("Number of features of the model must "
"match the input. Model n_features is %s and "
"input n_features is %s "
% (self.n_features_, n_features))
return X
所以,我不知道你如何处理这个问题。也许找出导致你需要的两个功能的百分比 - 即使我不确定它是否会像预期的那样工作。
注意:我使用的是scikit-learn v.0.18
编辑:正如@Vivek Kumar评论说这是一个问题,升级到0.20应该可以解决问题。