我正在使用管道来使用RandomizedSearchCV
执行特征选择和超参数优化。以下是代码摘要:
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.grid_search import RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from scipy.stats import randint as sp_randint
rng = 44
X_train, X_test, y_train, y_test =
train_test_split(data[features], data['target'], random_state=rng)
clf = RandomForestClassifier(random_state=rng)
kbest = SelectKBest()
pipe = make_pipeline(kbest,clf)
upLim = X_train.shape[1]
param_dist = {'selectkbest__k':sp_randint(upLim/2,upLim+1),
'randomforestclassifier__n_estimators': sp_randint(5,150),
'randomforestclassifier__max_depth': [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, None],
'randomforestclassifier__criterion': ["gini", "entropy"],
'randomforestclassifier__max_features': ['auto', 'sqrt', 'log2']}
clf_opt = RandomizedSearchCV(pipe, param_distributions= param_dist,
scoring='roc_auc', n_jobs=1, cv=3, random_state=rng)
clf_opt.fit(X_train,y_train)
y_pred = clf_opt.predict(X_test)
我对random_state
,train_test_split
和RandomForestClassifer
使用常量RandomizedSearchCV
。但是,如果我多次运行它,上面代码的结果会略有不同。更具体地说,我的代码中有几个测试单元,这些略有不同的结果导致测试单元失败。由于使用相同的random_state
,我不应该获得相同的结果吗?我在代码中遗漏了一些代码中的随机性吗?
答案 0 :(得分:3)
我经常回答自己的问题!我将把它留给其他有类似问题的人:
为了确保我避免任何随机性,我定义了一个随机种子。代码如下:
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.grid_search import RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from scipy.stats import randint as sp_randint
seed = np.random.seed(22)
X_train, X_test, y_train, y_test =
train_test_split(data[features], data['target'])
clf = RandomForestClassifier()
kbest = SelectKBest()
pipe = make_pipeline(kbest,clf)
upLim = X_train.shape[1]
param_dist = {'selectkbest__k':sp_randint(upLim/2,upLim+1),
'randomforestclassifier__n_estimators': sp_randint(5,150),
'randomforestclassifier__max_depth': [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, None],
'randomforestclassifier__criterion': ["gini", "entropy"],
'randomforestclassifier__max_features': ['auto', 'sqrt', 'log2']}
clf_opt = RandomizedSearchCV(pipe, param_distributions= param_dist,
scoring='roc_auc', n_jobs=1, cv=3)
clf_opt.fit(X_train,y_train)
y_pred = clf_opt.predict(X_test)
我希望它可以帮助别人!