我正在使用ipython的并行功能:
client = Client()
client.direct_view().use_dill()
def hostname():
import socket
return socket.gethostname()
# For joblib, we only want one client per host.
joblib_clients = dict(zip(map(lambda x: client[x].apply_sync(hostname), client.ids), client.ids))
lview_joblib = client.load_balanced_view(targets=joblib_clients.values())
dview = client.direct_view()
我还从本地和每个引擎上的磁盘读取数据
%%px --local
store = pd.HDFStore(data_file, 'r')
rows = store.select('results', ['cv_score_mean > 0'])
rows = rows.sort('cv_score_mean', ascending=False)
rows['results_index'] = rows.index
data_model = store.select('data_model')
p = re.compile('_coef$')
feature_set = rows.filter(regex='_coef$').dropna(axis=1, how='all').rename(columns=lambda x: p.sub('',x)).columns
然后我在所有引擎上定义并实例化一个类(但不在本地)。
%%px
class DataRegression(object):
def __init__(self, **kwargs): ...
...
def regress_z_scores(self, **kwargs):
...
regressions = DataRegression(data_model, feature_set)
通过负载均衡视图,我可以通过几种方式来调用此函数:
1)通过lambda函数:
# What exactly is this lambda function passing to the lview?
ar = lview_joblib.map(lambda run: regressions.regress_z_scores(run), runs)
2)尝试直接调用该函数:
# This fails with a NameError, because regressions.regress_z_scores is not defined
ar = lview_joblib.map(regressions.regress_z_scores, runs)
3)通过在本地创建回归对象:
%%px --local
class DataRegression(object):
def __init__(self, **kwargs): ...
...
def regress_z_scores(self, **kwargs):
...
regressions = DataRegression(data_model, feature_set)
# And invoking it through the name.
ar = lview_joblib.map(regressions.regress_z_scores, runs)
# Does this mean that the local object gets pickled and passed to the client each time?
在每种情况下,负载均衡视图的map函数如何实际执行此函数调用?有最好的做法吗?