Question

我在具有不同参数的循环中运行相同的模拟。每个模拟都使用一个pandas DataFrame（data），它只能被读取，永远不会被修改。使用ipyparallel（IPython并行），我可以在模拟开始之前将此DataFrame放入我视图中每个引擎的全局变量空间中：

view['data'] = data

然后，引擎可以访问DataFrame以获取在其上运行的所有模拟。复制数据的过程（如果被腌制，data为40MB）只需几秒钟。但是，似乎如果模拟的数量增加，则内存使用量会变得非常大。我想这个共享数据正在为每个任务复制，而不是仅为每个引擎复制。从具有引擎的客户端共享静态只读数据的最佳实践是什么？每个引擎复制一次是可以接受的，但理想情况下每个主机只需要复制一次（我在host1上有4个引擎，在host2上有8个引擎）。

这是我的代码：

from ipyparallel import Client import pandas as pd rc = Client() view = rc[:] # use all engines view.scatter('id', rc.ids, flatten=True) # So we can track which engine performed what task def do_simulation(tweaks): """ Run simulation with specified tweaks """ # Do sim stuff using the global data DataFrame return results, id, tweaks if __name__ == '__main__': data = pd.read_sql("SELECT * FROM my_table", engine) threads = [] # store list of tweaks dicts for i in range(4): for j in range(5): for k in range(6): threads.append(dict(i=i, j=j, k=k) # Set up globals for each engine. This is the read-only DataFrame view['data'] = data ar = view.map_async(do_simulation, threads) # Our async results should pop up over time. Let's measure our progress: for idx, (results, id, tweaks) in enumerate(ar): print 'Progress: {}%: Simulation {} finished on engine {}'.format(100.0 * ar.progress / len(ar), idx, id) # Store results as a pickle for the future pfile = '{}_{}_{}.pickle'.format(tweaks['i'], tweaks['j'], tweaks['j']) # Save our results to a pickle file pd.to_pickle(results, out_file_path + pfile) print 'Total execution time: {} (serial time: {})'.format(ar.wall_time, ar.serial_time)

如果模拟计数很小（~50），则需要一段时间才能开始，但我开始看到进度打印语句。奇怪的是，多个任务将被分配到同一个引擎，并且在为该引擎完成所有分配的任务之前我不会看到响应。每次单个模拟任务完成时，我希望看到来自enumerate(ar)的响应。

如果模拟计数很大（~1000），开始需要很长时间，我看到CPU在所有引擎上都加油，但是很长时间（~40分钟）之后看不到进度打印语句我做看到进度，似乎一个大块（> 100）的任务进入同一个引擎，等待从一个引擎完成然后再提供一些进度。当那个引擎完成时，我看到ar对象提供了4秒的新响应 - 这可能是编写输出pickle文件的时间延迟。

最后，host1还运行ipycontroller任务，它的内存使用量就像疯了一样（Python任务显示使用＆gt; 6GB RAM，内核任务显示使用3GB）。 host2引擎根本没有真正显示大量内存使用情况。什么会导致内存中的这个峰值？

Answer 1

几年前我在代码中使用了这个逻辑，我使用了this。我的代码类似于：

shared_dict = {
    # big dict with ~10k keys, each with a list of dicts
}

balancer = engines.load_balanced_view()

with engines[:].sync_imports(): # your 'view' variable 
    import pandas as pd
    import ujson as json

engines[:].push(shared_dict)

results = balancer.map(lambda i: (i, my_func(i)), id)
results_data = results.get()

如果模拟计数很小（~50），则需要一段时间才能获得开始了，但我开始看到进展打印报表。奇怪的是，多个任务将被分配到同一个引擎，我没有看到响应，直到完成所有这些分配的任务发动机。我希望每次都能看到枚举（ar）的响应单个模拟任务完成。

在我的情况下，my_func()是一个复杂的方法，我将大量的日志消息写入文件，所以我有我的打印语句。

关于任务分配，正如我使用load_balanced_view()，我离开了图书馆找到它的方式，它做得很好。

如果模拟计数很大（~1000），则需要很长时间才能获得开始了，我看到CPU在所有引擎上熄火，但没有进展打印陈述直到很长一段时间（约40分钟），当我这样做看到进度，看起来大块（> 100）的任务变得相同发动机，并在提供之前等待从那一个发动机完成一些进步。当那个引擎完成后，我看到了ar对象提供了4秒的新响应 - 这可能是时间延迟写输出pickle文件。

很长一段时间，我没有经历过，所以我不能说什么。

我希望这可能会给你的问题带来一些启示。

PS：正如我在评论中所说，你可以试试multiprocessing.Pool。我想我没有尝试使用它来共享一个大的只读数据作为全局变量。我会尝试一下，因为it seems to work。

Answer 2

有时您需要按类别分散数据分组，以确保每个子组都完全包含在一个集群中。

这是我通常的做法：

# Connect to the clusters
import ipyparallel as ipp
client = ipp.Client()
lview  = client.load_balanced_view()
lview.block = True
CORES = len(client[:])

# Define the scatter_by function
def scatter_by(df,grouper,name='df'):
    sz = df.groupby([grouper]).size().sort_values().index.unique()
    for core in range(CORES):
        ids = sz[core::CORES]
        print("Pushing {0} {1}s into cluster {2}...".format(size(ids),grouper,core))
        client[core].push({name:df[df[grouper].isin(ids)]})

# Scatter the dataframe df grouping by `year`
scatter_by(df,'year')

请注意，我建议的散布函数可确保每个聚类将包含相似数量的观察值，这通常是个好主意。

如何在ipyparallel客户端和远程引擎之间最好地共享静态数据？

2 个答案: