没有OpenBLAS,为什么numpy / scipy更快?

时间:2015-04-13 23:02:56

标签: python performance numpy scipy openblas

我进行了两次安装:

  1. brew install numpy(和scipy)--with-openblas
  2. 克隆的GIT存储库(用于numpy和scipy)并自行构建
  3. 我克隆了两个方便的脚本,用于在多线程环境中验证这些库:

    git clone https://gist.github.com/3842524.git
    

    然后,对于每个安装,我正在执行show_config

    python -c "import scipy as np; np.show_config()"
    

    这对于安装1来说非常好:

    lapack_opt_info:
        libraries = ['openblas', 'openblas']
        library_dirs = ['/usr/local/opt/openblas/lib']
        language = f77
    blas_opt_info:
        libraries = ['openblas', 'openblas']
        library_dirs = ['/usr/local/opt/openblas/lib']
        language = f77
    openblas_info:
        libraries = ['openblas', 'openblas']
        library_dirs = ['/usr/local/opt/openblas/lib']
        language = f77
    blas_mkl_info:
        NOT AVAILABLE
    

    但安装2事情并不那么光明:

    lapack_opt_info:
        extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
        extra_compile_args = ['-msse3']
        define_macros = [('NO_ATLAS_INFO', 3)]
    blas_opt_info:
        extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
        extra_compile_args = ['-msse3', '- I/System/Library/Frameworks/vecLib.framework/Headers']
    define_macros = [('NO_ATLAS_INFO', 3)]
    

    所以当我无法正确链接OpenBLAS时。但现在好了,这里是性能结果。所有测试均在iMac,Yosemite,i7-4790K,4核,超线程上进行。

    首次安装OpenBLAS:

    numpy的:

    OMP_NUM_THREADS=1 python test_numpy.py
    FAST BLAS
    version: 1.9.2
    maxint: 9223372036854775807
    dot: 0.126578998566 sec
    
    OMP_NUM_THREADS=2 python test_numpy.py
    FAST BLAS
    version: 1.9.2
    maxint: 9223372036854775807
    dot: 0.0640147686005 sec
    
    OMP_NUM_THREADS=4 python test_numpy.py
    FAST BLAS
    version: 1.9.2
    maxint: 9223372036854775807
    dot: 0.0360922336578 sec
    
    OMP_NUM_THREADS=8 python test_numpy.py
    FAST BLAS
    version: 1.9.2
    maxint: 9223372036854775807
    dot: 0.0364527702332 sec
    

    SciPy的:

    OMP_NUM_THREADS=1 python test_scipy.py
    cholesky: 0.0276656150818 sec
    svd: 0.732437372208 sec
    
    OMP_NUM_THREADS=2 python test_scipy.py
    cholesky: 0.0182101726532 sec
    svd: 0.441690778732 sec
    
    OMP_NUM_THREADS=4 python test_scipy.py
    cholesky: 0.0130400180817 sec
    svd: 0.316107988358 sec
    
    OMP_NUM_THREADS=8 python test_scipy.py
    cholesky: 0.012854385376 sec
    svd: 0.315939807892 sec
    

    没有OpenBLAS的第二次安装:

    numpy的:

    OMP_NUM_THREADS=1 python test_numpy.py
    slow blas
    version: 1.10.0.dev0+3c5409e
    maxint: 9223372036854775807
    dot: 0.0371072292328 sec
    
    OMP_NUM_THREADS=2 python test_numpy.py
    slow blas
    version: 1.10.0.dev0+3c5409e
    maxint: 9223372036854775807
    dot: 0.0215149879456 sec
    
    OMP_NUM_THREADS=4 python test_numpy.py
    slow blas
    version: 1.10.0.dev0+3c5409e
    maxint: 9223372036854775807
    dot: 0.0146862030029 sec
    
    OMP_NUM_THREADS=8 python test_numpy.py
    slow blas
    version: 1.10.0.dev0+3c5409e
    maxint: 9223372036854775807
    dot: 0.0141334056854 sec
    

    SciPy的:

    OMP_NUM_THREADS=1 python test_scipy.py
    cholesky: 0.0109382152557 sec
    svd: 0.32529540062 sec
    
    OMP_NUM_THREADS=2 python test_scipy.py
    cholesky: 0.00988121032715 sec
    svd: 0.331357002258 sec
    
    OMP_NUM_THREADS=4 python test_scipy.py
    cholesky: 0.00916676521301 sec
    svd: 0.318637990952 sec
    
    OMP_NUM_THREADS=8 python test_scipy.py
    cholesky: 0.00931282043457 sec
    svd: 0.324427986145 sec
    

    令我惊讶的是,第二种情况比第一种情况要快。在scipy的情况下,添加更多内核后性能没有增加,但即使一个内核比OpenBLAS中的4个内核更快。

    有谁知道为什么会这样?

1 个答案:

答案 0 :(得分:8)

有两个明显的差异可能导致差异:

  1. 您正在比较numpy的两个不同版本。使用Homebrew安装的OpenBLAS链接版本是1.9.1,而您从源代码构建的版本是1.10.0.dev0 + 3c5409e。

  2. 虽然新版本没有与OpenBLAS相关联,但它与Apple的Accelerate Framework链接,这是一个不同的优化BLAS实现。


  3. 您的测试脚本仍然为第二种情况报告slow blas的原因是由于与最新版本的numpy不兼容。您正在使用的脚本测试numpy是否通过checking for the presence of numpy.core._dotblas链接到优化的BLAS库:

    try:
        import numpy.core._dotblas
        print 'FAST BLAS'
    except ImportError:
        print 'slow blas'
    

    在numpy的旧版本中,如果找到优化的BLAS库,则只能在安装过​​程中编译此C模块。但是,_dotblas has been removed altogether in development versions > 1.10.0(如this previous SO question中所述),因此脚本将始终为这些版本报告slow blas

    我已经编写了numpy测试脚本的更新版本,该脚本可以正确报告最新版本的BLAS链接; you can find it here