如何在python中加快嵌套交叉验证?

时间:2019-04-23 09:49:07

标签: python parallel-processing scikit-learn dask cross-validation

从我发现的内容来看,还有另外1个这样的问题(Speed-up nested cross-validation),但是在尝试了该站点和microsoft上建议的几种修复之后,安装MPI对我也不起作用,所以我希望还有另一个问题包装或回答该问题。

我正在寻找比较多种算法和进行网格搜索的各种参数(也许参数太多?)的方法,除了mpi4py之外还有什么方法可以加快我的代码的运行速度?据我了解,我不能使用n_jobs = -1,因为那是不嵌套的?

还要注意,我无法在下面尝试查看的许多参数上运行它(运行时间超过了我的时间)。如果我给每个模型仅两个参数进行比较,则只有2小时后才会有结果。此外,我在252行和25个特征列以及4个类别变量的数据集上运行此代码,以预测(“确定”,“可能”,“可能”或“未知”)某个基因(具有252个基因)是否影响疾病。使用SMOTE会将样本大小增加到420,这样就可以使用了。

dataset= pd.read_csv('data.csv')
data = dataset.drop(["gene"],1)
df = data.iloc[:,0:24]
df = df.fillna(0)
X = MinMaxScaler().fit_transform(df)

le = preprocessing.LabelEncoder()
encoded_value = le.fit_transform(["certain", "likely", "possible", "unlikely"])
Y = le.fit_transform(data["category"])

sm = SMOTE(random_state=100)
X_res, y_res = sm.fit_resample(X, Y)

seed = 7
logreg = LogisticRegression(penalty='l1', solver='liblinear',multi_class='auto')
LR_par= {'penalty':['l1'], 'C': [0.5, 1, 5, 10], 'max_iter':[500, 1000, 5000]}

rfc =RandomForestClassifier()
param_grid = {'bootstrap': [True, False],
              'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
              'max_features': ['auto', 'sqrt'],
              'min_samples_leaf': [1, 2, 4,25],
              'min_samples_split': [2, 5, 10, 25],
              'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}

mlp = MLPClassifier(random_state=seed)
parameter_space = {'hidden_layer_sizes': [(10,20), (10,20,10), (50,)],
     'activation': ['tanh', 'relu'],
     'solver': ['adam', 'sgd'],
     'max_iter': [10000],
     'alpha': [0.1, 0.01, 0.001],
     'learning_rate': ['constant','adaptive']}

gbm = GradientBoostingClassifier(min_samples_split=25, min_samples_leaf=25)
param = {"loss":["deviance"],
    "learning_rate": [0.15,0.1,0.05,0.01,0.005,0.001],
    "min_samples_split": [2, 5, 10, 25],
    "min_samples_leaf": [1, 2, 4,25],
    "max_depth":[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
    "max_features":['auto', 'sqrt'],
    "criterion": ["friedman_mse"],
    "n_estimators":[200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
    }

svm = SVC(gamma="scale", probability=True)
tuned_parameters = {'kernel':('linear', 'rbf'), 'C':(1,0.25,0.5,0.75)}

def baseline_model(optimizer='adam', learn_rate=0.01):
    model = Sequential()
    model.add(Dense(100, input_dim=X_res.shape[1], activation='relu')) 
    model.add(Dropout(0.5))
    model.add(Dense(50, activation='relu')) #8 is the dim/ the number of hidden units (units are the kernel)
    model.add(Dense(4, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

keras = KerasClassifier(build_fn=baseline_model, batch_size=32, epochs=100, verbose=0)
learn_rate = [0.001, 0.01, 0.1, 0.2, 0.3]
optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
kerasparams = dict(optimizer=optimizer, learn_rate=learn_rate)

inner_cv = KFold(n_splits=10, shuffle=True, random_state=seed)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=seed)

models = []
models.append(('GBM', GridSearchCV(gbm, param, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('RFC', GridSearchCV(rfc, param_grid, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('LR', GridSearchCV(logreg, LR_par, cv=inner_cv, iid=False, n_jobs=1)))
models.append(('SVM', GridSearchCV(svm, tuned_parameters, cv=inner_cv, iid=False, n_jobs=1)))
models.append(('MLP', GridSearchCV(mlp, parameter_space, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('Keras', GridSearchCV(estimator=keras, param_grid=kerasparams, cv=inner_cv,iid=False, n_jobs=1)))


results = []
names = []
scoring = 'accuracy'
X_train, X_test, Y_train, Y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=0)


for name, model in models:
    nested_cv_results = model_selection.cross_val_score(model, X_res, y_res, cv=outer_cv, scoring=scoring)
    results.append(nested_cv_results)
    names.append(name)
    msg = "Nested CV Accuracy %s: %f (+/- %f )" % (name, nested_cv_results.mean()*100, nested_cv_results.std()*100)
    print(msg)
    model.fit(X_train, Y_train)
    print('Test set accuracy: {:.2f}'.format(model.score(X_test, Y_test)*100),  '%')
    print("Best Parameters: \n{}\n".format(model.best_params_))
    print("Best CV Score: \n{}\n".format(model.best_score_))

举个例子,大多数数据集都是二进制的,看起来像这样:

gene   Tissue    Druggable Eigenvalue CADDvalue Catalogpresence   Category
ACE      1           1         1          0           1            Certain
ABO      1           0         0          0           0            Likely
TP53     1           1         0          0           0            Possible

任何有关如何加快速度的指南都将受到赞赏。

编辑:我也尝试过使用dask进行并行处理,但是我不确定这样做是否正确,而且运行似乎没有更快:

for name, model in models:
    with joblib.parallel_backend('dask'):
        nested_cv_results = model_selection.cross_val_score(model, X_res, y_res, cv=outer_cv, scoring=scoring)
        results.append(nested_cv_results)
        names.append(name)
        msg = "Nested CV Accuracy %s: %f (+/- %f )" % (name, nested_cv_results.mean()*100, nested_cv_results.std()*100)
        print(msg)
        model.fit(X_train, Y_train)
        print('Test set accuracy: {:.2f}'.format(model.score(X_test, Y_test)*100),  '%')
    #print("Best Estimator: \n{}\n".format(model.best_estimator_))
        print("Best Parameters: \n{}\n".format(model.best_params_))
        print("Best CV Score: \n{}\n".format(model.best_score_)) #average of all cv folds for a single combination of the parameters you specify 

编辑:还要注意减少网格搜索,例如,我尝试使用每个模型5个参数,但是这仍然需要几个小时才能完成,因此,如果有效率方面的任何建议,则在减少数量的同时会有所帮助我将不胜感激。

4 个答案:

答案 0 :(得分:3)

Dask-ML具有可扩展的实现GridSearchCVRandomSearchCV,我相信它们会替代Scikit-Learn。它们与Scikit-Learn开发人员一起开发。

它们可以更快的原因有两个:

答案 1 :(得分:2)

两件事:

  1. 代替admin.auth().setCustomUserClaims(uid, {user_type: "driver"}).then(() => { // The new custom claims will propagate to the user's ID token the // next time a new one is issued. }); 尝试使用HyperOpt-它是用于串行和并行优化的Python库。

  2. 我将通过使用UMAPPCA来降低尺寸。 UMAP可能是更好的选择。

应用GridSearch后:

SMOTE

然后您可以使用import umap dim_reduced = umap.UMAP( min_dist=min_dist, n_neighbors=neighbours, random_state=1234, ).fit_transform(smote_output) 进行列车测试。

减小维数将有助于消除数据中的噪声,您无需处理25个功能,而是将它们降低到2个(使用UMAP)或选择的组件数量(使用PCA)。这应该对性能产生重大影响。

答案 2 :(得分:1)

在您的情况下有个轻松的胜利,那就是....开始使用并行处理:)。如果您有群集,dask将为您提供帮助(它可以在一台计算机上运行,​​但是与sklearn中的默认计划相比,改进并不明显),但是如果您打算在一个群集上运行它,一台机器(但具有多个内核/线程和“足够的”内存),则可以并行运行嵌套的CV。唯一的技巧是sklearn不允许您在多个进程中运行外部CV 循环。但是,它将允许您在多个线程中运行内部循环

目前,您在外部CV循环中有n_jobs=None(这是cross_val_score中的默认值),这意味着n_jobs=1,这是您可以与{{1 }}放在嵌套的简历中。

但是,通过在所有使用的sklearn中设置n_jobs=some_reasonable_number,可以轻松获得收益。 GridSearchCV不一定是some_reasonable_number(但这是一个很好的起点)。有些算法要么停留在-1上,而不是n_jobs=n_cores(例如,n_threads),要么已经具有内置的多处理功能(例如,xgboost),并且可能如果产生太多进程,则会发生冲突。

答案 3 :(得分:1)

IIUC,您正在尝试从sklearn文档中并行化this example。如果是这样,那么这里是一种可能的解决方法

为什么达斯克不起作用

关于此问题的任何建设性指导或进一步知识

一般进口

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn import preprocessing
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold, train_test_split
from sklearn.neural_network import MLPClassifier
import dask_ml.model_selection as dcv


import time

数据

  • 我定义了3个数据集来尝试实施dask_ml
    • 第三个(数据集3)的#行的大小是可调的,并且可以根据您的计算能力任意增加
      • 我仅使用此数据集定时执行dask_ml
    • 下面的代码可用于所有3个数据集
    • 数据集1是SO问题中样本数据的稍长版本
#### Dataset 1 - longer version of data in the question
d = """gene Tissue Druggable Eigenvalue CADDvalue Catalogpresence Category
ACE 1 1 1 0 1 Certain
ABO 1 0 0 0 0 Likely
TP53 1 1 0 0 0 Possible"""
data = pd.DataFrame([x.split(' ') for x in d.split('\n')])
data.columns = data.loc[0,:]
data.drop(0, axis=0, inplace=True)
data = pd.concat([data]*15)

data = data.drop(["gene"],1)
df = data.iloc[:,0:5]

X = MinMaxScaler().fit_transform(df)
le = preprocessing.LabelEncoder()
encoded_value = le.fit_transform(["Certain", "Likely", "Possible"])
Y = le.fit_transform(data["Category"])

sm = SMOTE(random_state=100)
X_res, y_res = sm.fit_resample(X, Y)
#### Dataset 2 - iris dataset from example in sklearn nested cross validation docs
# Load the dataset
from sklearn.datasets import load_iris
iris = load_iris()
X_res = iris.data
y_res = iris.target
#### Dataset 3 - size (#rows, #columns) is adjustable (I used this to time code execution)
X_res = pd.DataFrame(np.random.rand(300,50), columns=['col_'+str(c+1) for c in list(range(50))])
from random import shuffle
cats = ["paris", "barcelona", "kolkata", "new york", 'sydney']
y_values = cats*int(len(X_res)/len(cats))
shuffle(y_values)
y_res = pd.Series(y_values)

实例化分类器-问题代码没有变化

seed = 7
logreg = LogisticRegression(penalty='l1', solver='liblinear',multi_class='auto')
LR_par= {'penalty':['l1'], 'C': [0.5, 1, 5, 10], 'max_iter':[500, 1000, 5000]}

mlp = MLPClassifier(random_state=seed)
parameter_space = {'hidden_layer_sizes': [(10,20), (10,20,10), (50,)],
     'activation': ['tanh', 'relu'],
     'solver': ['adam', 'sgd'],
     'max_iter': [10000],
     'alpha': [0.1, 0.01, 0.001],
     'learning_rate': ['constant','adaptive']}

rfc =RandomForestClassifier()
param_grid = {'bootstrap': [True, False],
              'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
              'max_features': ['auto', 'sqrt'],
              'min_samples_leaf': [1, 2, 4,25],
              'min_samples_split': [2, 5, 10, 25],
              'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}

gbm = GradientBoostingClassifier(min_samples_split=25, min_samples_leaf=25)
param = {"loss":["deviance"],
    "learning_rate": [0.15,0.1,0.05,0.01,0.005,0.001],
    "min_samples_split": [2, 5, 10, 25],
    "min_samples_leaf": [1, 2, 4,25],
    "max_depth":[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
    "max_features":['auto', 'sqrt'],
    "criterion": ["friedman_mse"],
    "n_estimators":[200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
    }

svm = SVC(gamma="scale", probability=True)
tuned_parameters = {'kernel':('linear', 'rbf'), 'C':(1,0.25,0.5,0.75)}
inner_cv = KFold(n_splits=10, shuffle=True, random_state=seed)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=seed)

使用GridSearchCV实现的dask_ml(最初由@MRocklin here提出)-请参阅dask_ml文档以获取dask_ml.model_selection.GridSearchCV

  • 为简洁起见,我不包括KerasClassifier和辅助函数baseline_model(),但我处理前者的方法与其他方法相同
models = []
models.append(('MLP', dcv.GridSearchCV(mlp, parameter_space, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('GBM', dcv.GridSearchCV(gbm, param, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('RFC', dcv.GridSearchCV(rfc, param_grid, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('LR', dcv.GridSearchCV(logreg, LR_par, cv=inner_cv, iid=False, n_jobs=1)))
models.append(('SVM', dcv.GridSearchCV(svm, tuned_parameters, cv=inner_cv, iid=False, n_jobs=1)))

初始化一个额外的空白列表以保存非嵌套的简历结果

non_nested_results = []
nested_results = []
names = []
scoring = 'accuracy'
X_train, X_test, Y_train, Y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=0)

Joblib和dask客户端设置

# Create a local cluster
from dask.distributed import Client
client = Client(processes=False, threads_per_worker=4,
        n_workers=1, memory_limit='6GB')
from sklearn.externals import joblib

根据sklearn docs example执行嵌套的简历

  • 首先执行GridSearchCV
  • 第二次使用cross_val_score
  • 请注意,出于演示目的,我只使用了问题示例代码中模型列表中的1个sklearn模型(SVC
start = time.time()
for name, model in [models[-1]]:
  # Non_nested parameter search and scoring
  with joblib.parallel_backend('dask'):
    model.fit(X_train, Y_train)
  non_nested_results.append(model.best_score_)

  # Nested CV with parameter optimization
  nested_score = cross_val_score(model, X=X_train, y=Y_train, cv=outer_cv)
  nested_results.append(nested_score.mean())

  names.append(name)
  msg = "Nested CV Accuracy %s: %f (+/- %f )" %\
        (name, np.mean(nested_results)*100, np.std(nested_results)*100)
  print(msg)
  print('Test set accuracy: {:.2f}'.format(model.score(X_test, Y_test)*100),  '%')
  print("Best Estimator: \n{}\n".format(model.best_estimator_))
  print("Best Parameters: \n{}\n".format(model.best_params_))
  print("Best CV Score: \n{}\n".format(model.best_score_))

score_difference = [a_i - b_i for a_i, b_i in zip(non_nested_results, nested_results)]
print("Average difference of {0:6f} with std. dev. of {1:6f}."
      .format(np.mean(score_difference), np.std(score_difference)))

print('Total running time of the script: {:.2f} seconds' .format(time.time()-start))

client.close()

下面是使用数据集3的输出(具有脚本执行时间)

不带dask 1

的输出和计时
Nested CV Accuracy SVM: 20.416667 (+/- 0.000000 )
Test set accuracy: 16.67 %
Best Estimator: 
SVC(C=0.75, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Best Parameters: 
{'C': 0.75, 'kernel': 'linear'}

Best CV Score: 
0.2375

Average difference of 0.033333 with std. dev. of 0.000000.
Total running time of the script: 23.96 seconds

使用dask(使用n_workers=1threads_per_worker=4的Output + Timing) 2

Nested CV Accuracy SVM: 18.750000 (+/- 0.000000 )
Test set accuracy: 13.33 %
Best Estimator: 
SVC(C=0.5, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Best Parameters: 
{'C': 0.5, 'kernel': 'rbf'}

Best CV Score: 
0.1916666666666667

Average difference of 0.004167 with std. dev. of 0.000000.
Total running time of the script: 8.84 seconds

使用dask(使用n_workers=4threads_per_worker=4的Output + Timing) 2

Nested CV Accuracy SVM: 23.333333 (+/- 0.000000 )
Test set accuracy: 21.67 %
Best Estimator: 
SVC(C=0.25, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Best Parameters: 
{'C': 0.25, 'kernel': 'linear'}

Best CV Score: 
0.25

Average difference of 0.016667 with std. dev. of 0.000000.
Total running time of the script: 7.52 seconds

使用dask(使用n_workers=1threads_per_worker=8的Output + Timing) 2

Nested CV Accuracy SVM: 20.416667 (+/- 0.000000 )
Test set accuracy: 18.33 %
Best Estimator: 
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Best Parameters: 
{'C': 1, 'kernel': 'rbf'}

Best CV Score: 
0.23333333333333334

Average difference of 0.029167 with std. dev. of 0.000000.
Total running time of the script: 7.06 seconds

1 使用sklearn.model_selection.GridSearchCV()而不使用joblib()

2 使用dask_ml.model_selection.GridSearchCV()替换sklearn.model_selection.GridSearchCV()并使用joblib()

此答案中有关代码和输出的注释

  • 我在您的问题中注意到,与文档中的示例相比,您将sklearn.model_selection.GridSearchCV()cross_val_score的顺序颠倒了
    • 不确定这是否会对您的问题产生太大影响,但以为我会提及
  • 我没有嵌套交叉验证的经验,因此无法评论Client(..., n_workers=n, threads_per_worker=m)n>1和/或m=4 or m=8是否可接受/不正确

关于dask_ml 使用的一般评论(据我了解)

  • Case 1:如果训练数据足够小以适合单个计算机上的内存,但测试数据集不适合内存,则可以使用包装器{{3 }}
    • 将测试数据并行读取到集群中
    • 使用集群中的所有工作程序对测试数据进行并行预测
    • IIUC,这种情况与您的问题无关
  • ParallelPostFit:如果您想使用joblib在集群上训练大型scikit-learn模型(但训练/测试数据适合内存)-aka分布式scikit-learn -那么您可以使用集群进行训练,并且骨架代码(根据dask_ml文档)如下所示
    • IIUC这种情况是
      • 与您的问题有关
      • 我在此答案中使用的方法

系统详细信息(用于执行代码)

dask==1.2.0
dask-ml==0.12.0
numpy==1.16.2+mkl
pandas==0.24.0
scikit-learn==0.20.3
sklearn==0.0
OS==Windows 8 (64-bit)
Python version (import platform; print(platform.python_version()))==3.7.2