Question

以下是了解失去工人的情况。这是

的后续内容

Distributing graphs to across cluster nodes

这个例子并不是很小，但它确实给出了我们典型工作模式的概念。睡眠是导致问题的必要条件。这种情况发生在完整的应用程序中，因为需要根据以前的结果生成一个大图。

当我在群集上运行它时，我使用dask-ssh在8个节点上获得32名工作人员：

dask-ssh --nprocs 4 --nthreads 1 --scheduler-port 8786 --log-directory `pwd` --hostfile hostfile.$JOBID &
sleep 10

对于全套工人，它应该在不到10分钟的时间内运行。我按照诊断屏幕上的执行情况进行操作。在事件中，我看到工作人员被添加但是我有时但并不总是看到删除了一些工作人员，通常只留下那些托管调度程序的节点。

""" Test to illustrate losing workers under dask/distributed.

This mimics the overall structure and workload of our processing.

Tim Cornwell 9 Sept 2017
realtimcornwell@gmail.com
"""
import numpy
from dask import delayed
from distributed import Client


# Make some randomly located points on 2D plane
def init_sparse(n, margin=0.1):
    numpy.random.seed(8753193)
    return numpy.array([numpy.random.uniform(margin, 1.0 - margin, n),
                        numpy.random.uniform(margin, 1.0 - margin, n)]).reshape([n, 2])


# Put the points onto a grid and FFT, skip to save time
def grid_data(sparse_data, shape, skip=100):
    grid = numpy.zeros(shape, dtype='complex')
    loc = numpy.round(shape * sparse_data).astype('int')
    for i in range(0, sparse_data.shape[0], skip):
        grid[loc[i,:]] = 1.0
    return numpy.fft.fft(grid).real

# Accumulate all psfs into one psf
def accumulate(psf_list):
    lpsf = 0.0 * psf_list[0]
    for p in psf_list:
        lpsf += p
    return lpsf


if __name__ == '__main__':
    import sys
    import time
    start=time.time()

    # Process nchunks each of length len_chunk 2d points, making a psf of size shape
    len_chunk = int(1e6)
    nchunks = 16
    shape=[512, 512]
    skip = 100

    # We pass in the scheduler from the invoking script
    if len(sys.argv) > 1:
        scheduler = sys.argv[1]
        client = Client(scheduler)
    else:
        client = Client()

    print("On initialisation", client)

    sparse_graph = [delayed(init_sparse)(len_chunk) for i in range(nchunks)]
    sparse_graph = client.compute(sparse_graph, sync=True)
    print("After first sparse_graph", client)

    xfr_graph = [delayed(grid_data)(s, shape=shape, skip=skip) for s in sparse_graph]
    xfr = client.compute(xfr_graph, sync=True)
    print("After xfr", client)

    tsleep = 120.0
    print("Sleeping now for %.1f seconds" % tsleep)
    time.sleep(tsleep)
    print("After sleep", client)

    sparse_graph = [delayed(init_sparse)(len_chunk) for i in range(nchunks)]
    # sparse_graph = client.compute(sparse_graph, sync=True)
    xfr_graph = [delayed(grid_data)(s, shape=shape, skip=skip) for s in sparse_graph]
    psf_graph = delayed(accumulate)(xfr_graph)
    psf = client.compute(psf_graph, sync=True)

    print("*** Successfully reached end in %.1f seconds ***" % (time.time() - start))
    print(numpy.max(psf))
    print("After psf", client)

    client.shutdown()
    exit()

Grep'ing客户节目的典型运行：

On initialisation <Client: scheduler='tcp://sand-8-17:8786' processes=16 cores=16>
After first sparse_graph <Client: scheduler='tcp://sand-8-17:8786' processes=16 cores=16>
After xfr <Client: scheduler='tcp://sand-8-17:8786' processes=16 cores=16>
After sleep <Client: scheduler='tcp://sand-8-17:8786' processes=4 cores=4>
After psf <Client: scheduler='tcp://sand-8-17:8786' processes=4 cores=4>

谢谢，添

Answer 1

不清楚为什么会这样，但确实如此。我们使用的是dask-ssh，但需要更多地控制工人的创建。最终我们决定：

scheduler=$(head -1 hostfile.$JOBID)
hostIndex=0
for host in `cat hostfile.$JOBID`; do
    echo "Working on $host ...."
    if [ "$hostIndex" = "0" ]; then
        echo "run dask-scheduler"
        ssh $host dask-scheduler --port=8786 &
        sleep 5
    fi
    echo "run dask-worker"
    ssh $host dask-worker --host ${host} --nprocs NUMBER_PROCS_PER_NODE \
    --nthreads NUMBER_THREADS  \
    --memory-limit 0.25 --local-directory /tmp $scheduler:8786 &
    sleep 1
    hostIndex="1"
done
echo "Scheduler and workers now running"

随着时间推移，失去工人

1 个答案: