与joblib并行化 - 性能饱和度和一般注意事项

时间:2017-01-05 08:37:52

标签: python optimization parallel-processing multiprocessing joblib

我正在使用joblib来为构建离散数据的概率密度的简单任务获得一些效率。简而言之,令我感到困惑的是,我的性能提升与2个并行进程完全相同,而且没有任何东西可以获得更多。我也很好奇其他可能的方法来优化这个程序。我将首先详细介绍问题的具体细节。

我认为形状为X的二进制数组(n_samples, n_features)和分类标签的向量y。出于实验的目的,这将做:

import numpy as np
X = np.random.randint(0,2,size=[n_samples,n_features])
y = np.random.randint(0,10,size=[n_samples,])

函数joint_probability_binary将特征数组X(单个特征)的列和标签向量y作为输入,并输出它们的联合分布。没有什么花哨。

def joint_probability_binary(x, y):

    labels    = list(set(y))
    joint = np.zeros([len(labels), 2])

    for i in xrange(y.shape[0]):
        joint[y[i], x[i]] += 1

    return joint / float(y.shape[0])

现在,我想将joint_probability_binary应用于X的每个功能(每列)。我的理解是,这个任务(给定足够大的n_samples值)对于多处理并行性来说足够粗糙。我写了一个顺序和并行函数来执行这个任务。

from joblib import Parallel, delayed

def joints_sequential(X, y):
    return [joint_probability_binary(X[:,i],y) for i in range(X.shape[1])]

def joints_parallel(X, y, n_jobs):
    return Parallel(n_jobs=n_jobs, verbose=0)(
        delayed(joint_probability_binary)(X = X[:,i],y = y) 
        for i in range(X.shape[1]))

我改编了Guido van Rossum自己编写的计时功能,如下所示here,如下:

import time

def timing(f, n, **kwargs):
    r = range(n)
    t1 = time.clock()
    for i in r:
        f(**kwargs);
        f(**kwargs);
        f(**kwargs);
        f(**kwargs);
        f(**kwargs);
        f(**kwargs);
        f(**kwargs);
        f(**kwargs);
        f(**kwargs);
        f(**kwargs);
    t2 = time.clock()
    return round(t2 - t1, 3)

最后,为了研究绩效的变化及其对就业数量的依赖,我运行

tseq = timing(joints_sequential,10, X=X,y=y)
print('Sequential list comprehension - Finished in %s sec' %tseq)

for nj in range(1,9):
    tpar = timing(joints_parallel,10, X=X, y=y, n_jobs=nj)
    print('Parallel execution - %s jobs - Finished in %s sec' %(nj,tpar))

对于n_samples = 20000n_features = 20,我得到了

Sequential list comprehension - Finished in 60.778 sec
Parallel execution - 1 jobs - Finished in 61.975 sec
Parallel execution - 2 jobs - Finished in 6.446 sec
Parallel execution - 3 jobs - Finished in 7.516 sec
Parallel execution - 4 jobs - Finished in 8.275 sec
Parallel execution - 5 jobs - Finished in 8.953 sec
Parallel execution - 6 jobs - Finished in 9.962 sec
Parallel execution - 7 jobs - Finished in 10.382 sec
Parallel execution - 8 jobs - Finished in 11.321 sec

1

这一结果证实,并行完成此任务可以获得相当多的收益(在OS X上运行此操作,2 GHz Intel Core i7具有4个内核)。 然而,我发现最引人注目的是,n_jobs = 2的表现已经饱和了。考虑到每个任务的大小,我发现很难想象这可能仅由Joblib开销引起,但我的直觉再次受到限制。我使用较大的数组n_samples = 200000n_features = 40重复了实验,这导致了相同的行为: 顺序列表理解 - 在1230.172秒完成

Parallel execution - 1 jobs - Finished in 1198.981 sec
Parallel execution - 2 jobs - Finished in 94.624 sec
Parallel execution - 3 jobs - Finished in 95.1 sec
...

有没有人对为什么会出现这种情况有直觉(假设我的整体方法足够合理)?

2

最后,就整体优化而言,还有哪些其他方法可以改善此类程序的性能?我怀疑通过编写计算联合概率的函数的Cython实现可以获得很多东西,但我没有经验。

2 个答案:

答案 0 :(得分:0)

我的经验是,这通常是因为您超额订阅核心。在我的桌面上使用i7-3770,我得到以下内容:

Sequential list comprehension - Finished in 25.734 sec
Parallel execution - 1 jobs - Finished in 25.532 sec
Parallel execution - 2 jobs - Finished in 4.302 sec
Parallel execution - 3 jobs - Finished in 4.178 sec
Parallel execution - 4 jobs - Finished in 4.521 sec

在不了解您的系统的情况下,我无法提供太多帮助。但是,由于超线程或其他技术,笔记本电脑处理器通常会拥有比物理内核更多的逻辑内核。这不是一个与超线程相关的任务。 (例如,通过使用额外的线程,你不会看到性能有任何提高,因为这里没有任何东西被IO阻止,所以没有多少机会)。

你可能还有一个cpu,当一个或两个核心被大量使用时会自动提高其时钟速率,但是当所有核心被大量使用时会降低。这可以为两个内核提供额外的性能。

为了获得更高的性能,我建议使用from pyfunc()函数将joint_probability_binary()函数编写为numpy ufunc,以生成c版本。 https://docs.scipy.org/doc/numpy/reference/ufuncs.html

Numba也可以提供帮助,但我从未使用它http://numba.pydata.org/numba-doc/0.35.0/index.html

答案 1 :(得分:0)

该SO页面是我的Google请求“ joblib性能”的第一个页面时,我已经进行了一些调查。

  

有人对为什么会这样有直觉吗(假设我的总体方法足够合理)?

我认为问题是由内存限制的。不清楚的衡量标准使情况感到困惑。我运行原始代码并通过time python3 joblib_test.py从外部测量运行时间,在joblib_test.py中,我一直注释掉除一个评估之外的所有评估。在我的4核心CPU上,我使用了n_samples = 2000000,n_features = 40,并减少了重复次数:

  1. 顺序列表理解-在54.911秒内完成
    真实0m 55.307s

  2. 并行执行-4个作业-在2.515秒内完成
    真实0m 53.519s

它清楚地表明,实际运行时间几乎相同。

  

最后,在整体优化方面,还有什么其他方法可以提高此类程序的性能?

使用numba(所以它是import numba,用@numba.jit(nopython=True,cache=True)装饰实际的工作程序,对该工作程序进行较小的修改会导致 7的加速!< / p>

  1. 顺序列表理解(修改)-在7.665秒内完成
    实际0m 7.167s

  2. 并行执行(修改)-4个作业-在2.004秒内完成
    真正的0m 9.143s

再一次,它很好地展现了受内存带宽限制的事实。对于优化版本,使用4个内核会有一些开销。

完整代码示例:

n_samples = 2000000
n_features = 40

print("n_samples = ", n_samples, "  n_features = ", n_features)

import numpy as np
# X = np.random.randint(0,2,size=[n_samples,n_features])
# y = np.random.randint(0,10,size=[n_samples,])

def joint_probability_binary(x, y):

    labels    = list(set(y))
    joint = np.zeros([len(labels), 2])

    for i in range(y.shape[0]):
        joint[y[i], x[i]] += 1

    return joint / float(y.shape[0])

import numba
@numba.jit(nopython=True,cache=True)
def joint_probability_binary_mod(x, y):
    labels    = np.unique(y)
    joint = np.zeros((labels.size, 2))

    for i in range(y.shape[0]):
        joint[y[i], x[i]] += 1

    return joint / float(y.shape[0])

from joblib import Parallel, delayed

def joints_sequential(the_job):
    X = np.random.randint(0,2,size=[n_samples,n_features])
    y = np.random.randint(0,10,size=[n_samples,])
    return [the_job(X[:,i],y) for i in range(X.shape[1])]


def joints_parallel(n_jobs, the_job,batch_size='auto'):
    X = np.random.randint(0,2,size=[n_samples,n_features])
    y = np.random.randint(0,10,size=[n_samples,])
    return Parallel(n_jobs=n_jobs, verbose=0,batch_size=batch_size)(
        delayed(the_job)(x = X[:,i],y = y) 
        for i in range(X.shape[1])
    )

import time

def timing(f, n, **kwargs):
    r = range(n)
    t1 = time.clock()
    for i in r:
        res = f(**kwargs);
    t2 = time.clock()
    #print(np.sum(res))
    return round(t2 - t1, 3)

ttime = 0

# tseq = timing(joints_sequential,1, the_job=joint_probability_binary_mod)
# print('Sequential list comprehension (mod) - Finished in %s sec' %tseq)
# ttime+=tseq

for nj in range(4,5):
    tpar = timing(joints_parallel,1,n_jobs=nj,
                  the_job=joint_probability_binary_mod,
                  batch_size = int(n_samples/nj))
    print('Parallel execution (mod) - %s jobs - Finished in %s sec' %(nj,tpar))
    ttime+=tpar

# tseq = timing(joints_sequential,1, the_job=joint_probability_binary)
# print('Sequential list comprehension - Finished in %s sec' %tseq)
# ttime+=tseq

# for nj in range(4,5):
#     tpar = timing(joints_parallel,1,n_jobs=nj, the_job=joint_probability_binary)
#     print('Parallel execution - %s jobs - Finished in %s sec' %(nj,tpar))
#     ttime+=tpar

print("total time measured by Python",ttime)