在HPC群集客户端上使用dask_ml GridSearchCV

时间:2020-02-07 14:46:04

标签: python pandas joblib gridsearchcv dask-ml

我正在处理大型数据集,由于dask_ml tutorial中介绍的dask_ml中的GridSearchCV,我试图找到我的模型超参数。

这是我的python代码:

import pandas as pd
from sklearn.model_selection import GroupKFold
from dask_ml.model_selection import GridSearchCV
from dask.distributed import Client, progress
from dask_jobqueue import PBSCluster
from sklearn.externals import joblib

df_features = pd.read_sql_query(...)
df_labels = pd.read_sql_query(...)
df_groups = pd.read_sql_query(...)
splitter = list(GroupKFold(n_splits=cv_folds).split(features_values,
                                                    labels_values,
                                                    groups))
clf = RandomForestClassifier()
clf = GridSearchCV(clf,
                   cv_parameters,
                   cv=splitter,
                   return_train_score=True)

cluster = PBSCluster(cores=12,
                     memory="60GB")
cluster.adapt(minimum_jobs=5, maximum_jobs=10)
client = Client(cluster)
with joblib.parallel_backend('dask'):
    clf.fit(features_values, labels_values)

然后dask要求我如下使用client.scatter在worker上部署数据:

(_CVIterableWrapper(cv=[(array([     3,      4, .. ... e, False, True)
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and 
keep data on workers

    future = client.submit(func, big_data)    # bad

    big_future = client.scatter(big_data)     # good
    future = client.submit(func, big_future)  # good
  % (format_bytes(len(b)), s)

但是,如果我像教程中那样使用后端(带有散点图):

with joblib.parallel_backend('dask', scatter=[features_values, labels_values]):
    clf.fit(features_values, labels_values)

然后找不到工人:

distributed.core - ERROR - No workers found
Traceback (most recent call last):
  File "/.../lib/python3.6/site-packages/distributed/core.py", line 412, in handle_comm
    result = await result
  File "/.../lib/python3.6/site-packages/distributed/scheduler.py", line 2703, in scatter
    raise gen.TimeoutError("No workers found")
tornado.util.TimeoutError: No workers found

任何建议都将受到欢迎。

注意:如果我使用较小的数据集,则一切正常。

0 个答案:

没有答案