如何用sklearn管道和插补来预测

时间:2015-01-08 07:00:32

标签: python scikit-learn

我正在审核sklearn文档页面"在构建估算器之前输入缺失值" 相关代码是:

import numpy as np

from sklearn.datasets import load_boston 
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer 
from sklearn.cross_validation import cross_val_score

rng = np.random.RandomState(0)

dataset = load_boston() 
X_full, y_full = dataset.data, dataset.target
n_samples = X_full.shape[0] 
n_features = X_full.shape[1]

estimator = RandomForestRegressor(random_state=0, n_estimators=100)
score = cross_val_score(estimator, X_full, y_full).mean() 
print("Score with the entire dataset = %.2f" % score)

# Add missing values in 75% of the lines
missing_rate = 0.75 
n_missing_samples = np.floor(n_samples * missing_rate) 
missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples,
                                dtype=np.bool),
                       np.ones(n_missing_samples,
                               dtype=np.bool))) rng.shuffle(missing_samples) 
missing_features = rng.randint(0, n_features, n_missing_samples)

# Estimate the score without the lines containing missing values
X_filtered = X_full[~missing_samples, :] 
y_filtered = y_full[~missing_samples] estimator = RandomForestRegressor(random_state=0, n_estimators=100)          
score = cross_val_score(estimator, X_filtered, y_filtered).mean() 
print("Score without the samples containing missing values = %.2f" % score)

# Estimate the score after imputation of the missing values 
X_missing = X_full.copy()
X_missing[np.where(missing_samples)[0], missing_features] = 0 
y_missing = y_full.copy() 
estimator = Pipeline([("imputer", Imputer(missing_values=0,
                                          strategy="mean",
                                          axis=0)),
                      ("forest", RandomForestRegressor(random_state=0,
                                                       n_estimators=100))]) 
score = cross_val_score(estimator, X_missing, y_missing).mean() 
print("Score after imputation of the missing values = %.2f" % score)

现在我想预测,这应该很简单,但是

estimator.predict(X=X_filtered[1:10,:])

返回以下错误:

"AttributeError: 'Imputer' object has no attribute 'statistics_'"

这里有什么问题?

1 个答案:

答案 0 :(得分:0)

在研究了一点之后,我已经想到了这一点。问题是cross_val_score已生成多个模型,并且没有人用于预测。预测需要一个合适的模型。这可以通过几种方式解决,其中一个简单的方法就在这里:

estimator.fit_transform(X_missing, y_missing)
estimator.predict(X=X_filtered[1:10,:])

由于最初的示例是使用交叉验证,另一种可能的路径是使用GridSearchCV然后选择使用best_estimator_,但这超出了这个问题的范围。