我正在使用矩阵乘法的线程实现来处理我的自定义Matrix类,并且我遇到了加速的一些问题。
设置和调用操作如下所示:
Matrix<double> left(2,2);
left[0][0] = 1; left[0][1] = 2;
left[1][0] = 3; left[1][1] = 4;
Matrix<double> right(2,2);
right[0][0] = 1; right[0][1] = 0;
right[1][0] = 0; right[1][1] = 1;
Matrix<double> result = left * right;
这是我的类的简化版本,其中包含了查看操作如何工作所必需的实现。我包含了我的构造函数,复制构造函数,赋值运算符和析构函数以获得彻底性,但最后两个函数主要是我关注的内容。此外,我正在调换右侧被乘数以利用空间局部性,因此输出矩阵中的每个位置都是左被乘数的一行和右被乘数的一行的点积,与左被乘数的一行相比较右被乘数的一列。这是代码:
#include <iostream>
#include <thread>
using namespace std;
template <class type>
class Matrix {
public:
// Default Constructor
Matrix() { this->create(); }
// Custom Constructor
Matrix(int r, int c) { this->create(r,c); }
// Copy Constructor
Matrix(const Matrix<type> &m) { this->copy(m); }
// Assignment Operator
Matrix<type> operator=(const Matrix<type> &m);
// Destructor
~Matrix() { this->destroy(); }
// Accessor Functions
int getRows() const { return rows; }
int getCols() const { return cols; }
type* operator[](int i) const { return contents[i]; }
// Matrix Operations
Matrix<type> transpose() const;
// Matrix Multiplication
friend Matrix<type> operator*(const Matrix<type> &m1, const Matrix<type> &m2)
{ Matrix<type> call; return call.multiply(m1,m2); }
private:
// Private Member Functions
void create();
void create(int r, int c);
void copy(const Matrix<type> &m);
void destroy();
// Operator Overloading Functions
Matrix<type> multiply(const Matrix<type> &m1, const Matrix<type> &m2);
// Private Member Variables
int rows;
int cols;
type** contents;
};
// Default Constructor
template <class type>
void Matrix<type>::create() {
rows = 0;
cols = 0;
contents = NULL;
}
// Custom Constructor
template <class type>
void Matrix<type>::create(int r, int c) {
// Set Integer Values
rows = r;
cols = c;
// Allocate Two-Dimensional Data
contents = new type*[r];
for (int i = 0; i < r; i++) {
contents[i] = new type[c];
}
}
// Copy Constructor
template <class type>
void Matrix<type>::copy(const Matrix &m) {
// Create New Matrix From Existing One
this->rows = m.getRows();
this->cols = m.getCols();
// Allocate Two-Dimensional Data
this->contents = new type*[m.getRows()];
for (int i = 0; i < m.getRows(); i++) {
this->contents[i] = new type[m.getCols()];
}
// Copy Over Data
for (int i = 0; i < m.getRows(); i++) {
for (int j = 0; j < m.getCols(); j++) {
(*this)[i][j] = m[i][j];
}
}
}
// Assignment Operator
template <class type>
Matrix<type> Matrix<type>::operator=(const Matrix<type> &m) {
// Overwrite Existing Matrix with Another
// (Allowing for Self-Assignment)
if (this != &m) {
this->destroy();
this->copy(m);
}
return *this;
}
// Destructor
template <class type>
void Matrix<type>::destroy() {
// Frees Allocated Memory
for (int i = 0; i < rows; i++) {
delete[] contents[i];
}
delete[] contents;
}
// Matrix Transpose
template <class type>
Matrix<type> Matrix<type>::transpose() const {
Matrix<type> tran(cols,rows);
for (int i = 0; i < rows; i++) {
for (int j = 0; j < cols; j++) {
tran[j][i] = contents[i][j];
}
}
return tran;
}
// Threaded Matrix Multiplication
template <class type>
void matrixParallel(const Matrix<type>* a, const Matrix<type>* b, int numThreads, int currentThread, Matrix<type>* c) {
for (int i = currentThread; i < a->getRows(); i+=numThreads) {
for (int j = 0; j < b->getRows(); j++) {
type result = 0;
for (int k = 0; k < a->getCols(); k++) {
result += ((*a)[i][k]*(*b)[j][k]);
}
(*c)[i][j] = result;
}
}
}
// Matrix Multiplication
template <class type>
Matrix<type> Matrix<type>::multiply(const Matrix<type> &m1, const Matrix<type> &m2) {
if (m1.getCols() != m2.getRows()) {
cout << "Error: Cannot Multiply Matrices of Dimensions ";
cout << "(" << m1.getRows() << "x" << m1.getCols() << ")*";
cout << "(" << m2.getRows() << "x" << m2.getCols() << ")" << endl;
cout << " (Must be in the form (MxN)*(NxP)" << endl;
return Matrix<type>();
}
// Parallel Method
Matrix<type> m2t = m2.transpose();
Matrix<type> multiply(m1.getRows(), m2.getCols());
int numCPU = thread::hardware_concurrency();
thread* threads = new thread[numCPU];
const Matrix<type>* m1Pointer = &m1;
const Matrix<type>* m2tPointer = &m2t;
Matrix<type>* multiplyPointer = &multiply;
for (int i = 0; i < numCPU; i++) {
threads[i] = thread(matrixParallel<type>, m1Pointer, m2tPointer, numCPU, i, multiplyPointer);
}
for (int i = 0; i < numCPU; i++) {
threads[i].join();
}
delete[] threads;
return multiply;
}
(注意,使用编译器选项-std=c++11
在Cygwin中编译)
我朋友对operator*
的实现是一种解决方法,因为我发现我在模板化类上使用运算符的两个选择是使用一系列前向声明(我无法工作,并且不太喜欢格式),或者在运营商的第一个声明中实现整个操作;我最后通过使用一个虚拟调用对象并调用另一个函数来执行所需操作,同时仍然在同一类型下进行模板化,从而以更干净的方式完成第二个。
我不明白为什么我没有看到这种方法接近线性加速,因为每个线程只根据可用内核的数量完成一小部分工作。以下是我使用此操作在不同大小的矩阵上执行的一些基准数据:
/*
Single Thread
==========================================
5 Variables
------------------------------------------
100x5 * 5x5: 0.000157889 seconds
1000x5 * 5x5: 0.0010768 seconds
10000x5 * 5x5: 0.010099 seconds
100000x5 * 5x5: 0.112081 seconds
1000000x5 * 5x5: 1.04285 seconds
10 Variables
------------------------------------------
100x10 * 10x10: 0.000224202 seconds
1000x10 * 10x10: 0.00217571 seconds
10000x10 * 10x10: 0.0201944 seconds
100000x10 * 10x10: 0.203912 seconds
1000000x10 * 10x10: 2.04127 seconds
15 Variables
------------------------------------------
100x15 * 15x15: 0.000408143 seconds
1000x15 * 15x15: 0.00398906 seconds
10000x15 * 15x15: 0.0379782 seconds
100000x15 * 15x15: 0.381156 seconds
1000000x15 * 15x15: 3.81325 seconds
20 Variables
------------------------------------------
100x20 * 20x20: 0.000640239 seconds
1000x20 * 20x20: 0.00620069 seconds
10000x20 * 20x20: 0.060218 seconds
100000x20 * 20x20: 0.602554 seconds
1000000x20 * 20x20: 6.00925 seconds
2 Threads
==========================================
5 Variables
------------------------------------------
100x5 * 5x5: 0.000444063 seconds
1000x5 * 5x5: 0.00119759 seconds
10000x5 * 5x5: 0.00975319 seconds
100000x5 * 5x5: 0.09157 seconds
1000000x5 * 5x5: 0.965666 seconds
10 Variables
------------------------------------------
100x10 * 10x10: 0.000593268 seconds
1000x10 * 10x10: 0.00187927 seconds
10000x10 * 10x10: 0.0154861 seconds
100000x10 * 10x10: 0.161186 seconds
1000000x10 * 10x10: 1.5725 seconds
15 Variables
------------------------------------------
100x15 * 15x15: 0.000651292 seconds
1000x15 * 15x15: 0.00425471 seconds
10000x15 * 15x15: 0.0233983 seconds
100000x15 * 15x15: 0.232411 seconds
1000000x15 * 15x15: 2.43293 seconds
20 Variables
------------------------------------------
100x20 * 20x20: 0.000771287 seconds
1000x20 * 20x20: 0.0045547 seconds
10000x20 * 20x20: 0.0342536 seconds
100000x20 * 20x20: 0.381612 seconds
1000000x20 * 20x20: 3.79707 seconds
4 Threads
==========================================
5 Variables
------------------------------------------
100x5 * 5x5: 0.000690369 seconds
1000x5 * 5x5: 0.00120864 seconds
10000x5 * 5x5: 0.00994858 seconds
100000x5 * 5x5: 0.102673 seconds
1000000x5 * 5x5: 0.907731 seconds
10 Variables
------------------------------------------
100x10 * 10x10: 0.000896809 seconds
1000x10 * 10x10: 0.00287674 seconds
10000x10 * 10x10: 0.0177846 seconds
100000x10 * 10x10: 0.161331 seconds
1000000x10 * 10x10: 1.46384 seconds
15 Variables
------------------------------------------
100x15 * 15x15: 0.00100457 seconds
1000x15 * 15x15: 0.00366381 seconds
10000x15 * 15x15: 0.0291613 seconds
100000x15 * 15x15: 0.237525 seconds
1000000x15 * 15x15: 2.23676 seconds
20 Variables
------------------------------------------
100x20 * 20x20: 0.000928781 seconds
1000x20 * 20x20: 0.00486535 seconds
10000x20 * 20x20: 0.0421105 seconds
100000x20 * 20x20: 0.354478 seconds
1000000x20 * 20x20: 3.22576 seconds
*/
任何人都可以提供任何有关可能导致此问题的见解吗?任何评论将不胜感激。
答案 0 :(得分:2)
您所看到的是线程的昂贵程度。这就是为什么大多数现代框架甚至不使用线程,他们使用线程池,已经分配的线程列表等待工作。大多数还为.NET提供理论支持,但是实际的支持从未实现,因为线程池方法证明是足够的。
结合您在线程中执行的少数操作,上下文切换和各种样板文件,您根本不会看到对您的方法有多大好处。事实上,大多数矩阵库(Direct3D&#39; s等)使用SIMD(如果我没记错的话,SSE2),它没有上下文切换惩罚和更少的设置时间(只有两个movq
)。
我认为这是一个很酷的学习练习(虽然没有同步),但考虑到其他选择,这总体上是不切实际的。