我正在处理大型数据集,由于dask_ml tutorial中介绍的dask_ml中的GridSearchCV,我试图找到我的模型超参数。
这是我的python代码:
import pandas as pd
from sklearn.model_selection import GroupKFold
from dask_ml.model_selection import GridSearchCV
from dask.distributed import Client, progress
from dask_jobqueue import PBSCluster
from sklearn.externals import joblib
df_features = pd.read_sql_query(...)
df_labels = pd.read_sql_query(...)
df_groups = pd.read_sql_query(...)
splitter = list(GroupKFold(n_splits=cv_folds).split(features_values,
labels_values,
groups))
clf = RandomForestClassifier()
clf = GridSearchCV(clf,
cv_parameters,
cv=splitter,
return_train_score=True)
cluster = PBSCluster(cores=12,
memory="60GB")
cluster.adapt(minimum_jobs=5, maximum_jobs=10)
client = Client(cluster)
with joblib.parallel_backend('dask'):
clf.fit(features_values, labels_values)
然后dask要求我如下使用client.scatter在worker上部署数据:
(_CVIterableWrapper(cv=[(array([ 3, 4, .. ... e, False, True)
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers
future = client.submit(func, big_data) # bad
big_future = client.scatter(big_data) # good
future = client.submit(func, big_future) # good
% (format_bytes(len(b)), s)
但是,如果我像教程中那样使用后端(带有散点图):
with joblib.parallel_backend('dask', scatter=[features_values, labels_values]):
clf.fit(features_values, labels_values)
然后找不到工人:
distributed.core - ERROR - No workers found
Traceback (most recent call last):
File "/.../lib/python3.6/site-packages/distributed/core.py", line 412, in handle_comm
result = await result
File "/.../lib/python3.6/site-packages/distributed/scheduler.py", line 2703, in scatter
raise gen.TimeoutError("No workers found")
tornado.util.TimeoutError: No workers found
任何建议都将受到欢迎。
注意:如果我使用较小的数据集,则一切正常。