我需要执行矩阵向量乘法,其中矩阵是复杂的,对称的并且具有四个非对角线非零带。到目前为止,我使用稀疏BLAS例程mkl_zdiasymv来执行乘法,它在一个核心上工作正常。我想通过使用多线程(例如openMP)来获得性能提升。据我所知,一些(很多?)MKL例程都是有线程的。但是,如果我使用 mkl_set_num_threads(4) 我的程序仍然在一个单独的线程上运行。
这里给出一个具体的例子是我编译的一个小测试程序(使用icc 14.01):
icc mkl_test_mp.cpp -mkl -std=c++0x -openmp
mkl_test_mp.cpp:
#include <complex>
#include <vector>
#include <iostream>
#include <chrono>
typedef std::complex<double> complex;
using std::vector;
using namespace std::chrono;
#define MKL_Complex16 std::complex<double>
#include "mkl.h"
int vector_dimension = 10000000;
int number_of_multiplications = 100;
vector<complex> initialize_matrix() {
complex value_main_diagonal = complex(1, 2);
complex value_sub_and_super_diagonal = complex(3, 4);
complex value_far_off_diagonal = complex(5, 6);
std::vector<complex> matrix;
matrix.resize(1 * vector_dimension, value_main_diagonal);
matrix.resize(2 * vector_dimension, value_sub_and_super_diagonal);
matrix.resize(3 * vector_dimension, value_far_off_diagonal);
return matrix;
}
vector<complex> perform_matrix_vector_calculation(vector<complex>& matrix, const vector<complex>& x) {
mkl_set_num_threads(4);
vector<complex> result(vector_dimension);
char uplo = 'L'; // since the matrix is symmetric we only need to declare one triangular part of the matrix (here the lower one)
int number_of_nonzero_diagonals = 3;
vector<int> matrix_diagonal_offsets = {0, -1, -int(sqrt(vector_dimension))};
complex *x_data = const_cast<complex* >(x.data()); // I do not like this, but mkl expects non const pointer (??)
mkl_zdiasymv (
&uplo,
&vector_dimension,
matrix.data(),
&vector_dimension,
matrix_diagonal_offsets.data(),
&number_of_nonzero_diagonals,
x_data,
result.data()
);
return result;
}
void print(vector<complex>& x) {
for(complex z : x)
std::cerr << z;
std::cerr << std::endl;
}
void run() {
vector<complex> matrix = initialize_matrix();
vector<complex> current_vector(vector_dimension, 1);
for(int i = 0; i < number_of_multiplications; ++i) {
current_vector = perform_matrix_vector_calculation(matrix, current_vector);
}
std::cerr << current_vector[0] << std::endl;
}
int main() {
auto start = steady_clock::now();
run();
auto end = steady_clock::now();
std::cerr << "runtime = " << duration<double, std::milli> (end - start).count() << " ms" << std::endl;
std::cerr << "runtime per multiplication = " << duration<double, std::milli> (end - start).count()/number_of_multiplications << " ms" << std::endl;
}
甚至可以以这种方式并行化吗?我究竟做错了什么 ?还有其他建议可以加速乘法吗?
答案 0 :(得分:2)
由于您没有展示如何编译代码,您是否可以检查是否要链接多线程英特尔MKL库,例如并行线程?
例如(这是旧版本的MKL):
THREADING_LIB="$(MKL_PATH)/libmkl_$(IFACE_THREADING_PART)_thread.$(EXT)"
OMP_LIB = -L"$(CMPLR_PATH)" -liomp5
MKL发布中应该有一个示例目录,例如intel/composer_xe_2011_sp1.10.319/mkl/examples
。在那里,您可以检查spblasc/makefile
的内容,以了解如何针对特定版本的MKL正确链接多线程库。
另一个应该加快速度的建议就是添加编译器优化标志,例如。
OPT_FLAGS = -xHost -O3
允许icc
为您的体系结构生成优化代码,以便您的行最终成为:
icc mkl_test_mp.cpp -mkl -std=c++0x -openmp -xHost -O3