对于乘法大二进制矩阵(10Kx20K),我通常要做的是将矩阵转换为浮点矩阵并执行浮点矩阵乘法,因为整数矩阵乘法非常慢(have a look at here)。
这一次,我需要执行超过数十万次这样的乘法运算,并且甚至可以在平均事项上提高毫秒级性能。
我想要一个int
或float
矩阵,因为该产品可能包含不是0或1的元素。输入矩阵元素都是0或1,因此它们是可以存储为单个位。
在行向量和列向量之间的内积中(为了产生输出矩阵的一个元素),乘法简化为按位AND。添加仍然是添加,但我们可以添加具有填充计数功能的位,而不是单独循环它们。
其他一些布尔/二进制矩阵函数或比特而不是计算它们,产生一个比特矩阵结果,但这不是我需要的。
以下示例代码显示,将问题形成为std::bitset
,AND
和count
操作比矩阵乘法更快。
#include <iostream>
using std::cout; using std::endl;
#include <vector>
using std::vector;
#include <chrono>
#include <Eigen/Dense>
using Eigen::Map; using Eigen::Matrix; using Eigen::MatrixXf;
#include <random>
using std::random_device; using std::mt19937; using std::uniform_int_distribution;
#include <bitset>
using std::bitset;
using std::floor;
const int NROW = 1000;
const int NCOL = 20000;
const float DENSITY = 0.4;
const float DENOMINATOR = 10.0 - (10*DENSITY);
void fill_random(vector<float>& vec) {
random_device rd;
mt19937 eng(rd());
uniform_int_distribution<> distr(0, 10);
int nnz = 0;
for (int i = 0; i < NROW*NCOL; ++i)
vec.push_back(floor(distr(eng)/DENOMINATOR));
}
void matmul(vector<float>& vec){
float *p = vec.data();
MatrixXf A = Eigen::Map<Eigen::Matrix<float, NROW, NCOL, Eigen::RowMajor>>(p);
cout << "Eigen matrix has " << A.rows() << " rows and " << A.cols() << " columns." << endl;
cout << "Total non-zero values : " << A.sum() << endl;
cout << "The density of non-zero values is " << A.sum() * 1.0 / (A.cols()*A.rows()) << endl;
auto start = std::chrono::steady_clock::now();
MatrixXf B = A.transpose() * A;
auto end = (std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start)).count();
cout << "Mat mul took " << end << " ms"<< endl;
// Just to make sure the operation is not skipped by compiler
cout << "Eigen coo ";
for (int i=0; i<10; ++i)
cout << B(0,i) << " ";
cout << endl;
}
void bitset_op(vector<float>& vec) {
// yeah it's not a great idea to set size at compile time but have to
vector<bitset<NROW>> col_major(NCOL);
// right, multiple par for isn't a good idea, maybe in a parallel block
// Doing this for simplicity to profile second loop timing
// converting row major float vec to col major bool vec
#pragma omp parallel for
for (int j=0; j < NCOL; ++j) {
for (int i=0; i < NROW; ++i) {
col_major[j].set(i, vec[i*NCOL + j] && 1);
}
}
auto start = std::chrono::steady_clock::now();
vector<int> coo;
coo.assign(NCOL*NCOL, 0);
#pragma omp parallel for
for (int j=0; j < NCOL; ++j) {
for (int k=0; k<NCOL; ++k) {
coo[j*NCOL + k] = (col_major[j]&col_major[k]).count();
}
}
auto end = (std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start)).count();
cout << "bitset intersection took " << end << " ms"<< endl;
// Just to make sure the operation is not skipped by compiler
cout << "biset coo ";
for (int i=0; i<10; ++i)
cout << coo[i] << " ";
cout << endl;
}
int main() {
// Saving to float instead of int to speed up matmul
vector<float> vec;
fill_random(vec);
matmul(vec);
bitset_op(vec);
}
用以下方式运行:
g++ -O3 -fopenmp -march=native -I. -std=c++11 code.cpp -o code
我明白了:
Eigen matrix has 1000 rows and 20000 columns.
Total non-zero values : 9.08978e+06
The density of non-zero values is 0.454489
Mat mul took 1849 ms
Eigen coo 458 206 208 201 224 205 204 199 217 210
bitset intersection took 602 ms
biset coo 458 206 208 201 224 205 204 199 217 210
正如你所看到的,matmul作为bitset操作的集合比Eigen的float matmul快3倍,这是有道理的。
我想强调一下,我需要在100K (在HPC或云中)执行此操作,平均而言,毫秒级的性能提升会产生影响。
我不受任何特定库,C ++标准等的束缚。所以请随意回答任何您认为比使用GPU更快的解决方案,因为我无法使用它有很多原因。