嗨,假设您有两个不同的独立64位二进制矩阵A
和T
(T
是其自身的转置版本,使用转换版本的矩阵允许在乘法期间对T
的行而不是对二进制算术来说非常酷的列进行操作)并且你想要将这些矩阵相乘,唯一的事情是矩阵乘法结果被截断为64位并且如果你屈服于更大的值在某个特定矩阵单元格中1
生成的矩阵单元格将包含1
,否则0
A T
00000001 01111101
01010100 01100101
10010111 00010100
10110000 00011000 <-- This matrix is transposed
11000100 00111110
10000011 10101111
11110101 11000100
10100000 01100010
二元和传统乘法结果:
Binary Traditional
11000100 11000100
11111111 32212121
11111111 32213421
11111111 21112211
11101111 22101231
11001111 11001311
11111111 54213432
11001111 11001211
如何以最有效的方式以上述方式将这些矩阵相乘?
我试图利用二进制and
(即&
运算符)而不是在单独的位上执行乘法,在这种情况下,我必须为乘法准备数据:
ulong u;
u = T & 0xFF;
u = (u << 00) + (u << 08) + (u << 16) + (u << 24)
+ (u << 32) + (u << 40) + (u << 48) + (u << 56);
现在通过对两个整数and
和A
执行二进制u
,它将产生以下结果:
A u R C
00000001 01111101 00000001 1
01010100 01111101 01010100 3
10010111 01111101 00010101 3
10110000 01111101 00110000 2
11000100 01111101 01000100 2
10000011 01111101 00000001 1
11110101 01111101 01110101 5
10100000 01111101 00100000 1
在上面的示例中,R
包含A
位与u
位相乘的结果,并且为了获得最终值,我们必须sum
一行中的所有位。请注意,列C
包含的值等于上面生成的Traditional
矩阵乘法的第一列中的值。问题是在这个步骤中我必须操作一个单独的位,我认为这是次优的方法,我已经阅读http://graphics.stanford.edu/~seander/bithacks.html寻找一种方法来并行但没有运气,如果有人有关于如何将R
列中的值“展平”和“合并”到最终的64位矩阵的任何想法,如果你给我留下几行,我将不胜感激,
谢谢,
非常感谢David Eisenstat,最终的算法看起来像是:
var A = ...;
var T = ...; // T == transpose(t), t is original matrix, algorithm works with transposed matrix
var D = 0x8040201008040201UL;
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D);
以下代码:
public static void Main (string[] args){
ulong U;
var Random = new Xor128 ();
var timer = DateTime.Now;
var A = Random.As<IUniformRandom<UInt64>>().Evaluate();
var T = Random.As<IUniformRandom<UInt64>>().Evaluate();
var steps = 10000000;
for (var i = 0; i < steps; i++) {
ulong r = 0;
var d = 0x8040201008040201UL;
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d);
}
Console.WriteLine (DateTime.Now - timer);
var m1 = new Int32[8,8];
var m2 = new Int32[8,8];
var m3 = new Int32[8,8];
for (int row = 0; row < 8; row++) {
for (int col = 0; col < 8; col++) {
m1 [row, col] = Random.As<IUniformRandom<Int32>> ().Evaluate(0, 1);
m2 [row, col] = Random.As<IUniformRandom<Int32>> ().Evaluate(0, 1);
m3 [row, col] = Random.As<IUniformRandom<Int32>> ().Evaluate(0, 1);
}
}
timer = DateTime.Now;
for (int i = 0; i < steps; i++) {
for (int row = 0; row < 8; row++) {
for (int col = 0; col < 8; col++) {
var sum = 0;
for (int temp = 0; temp < 8; temp++) {
sum += m1 [row, temp] * m2 [temp, row];
}
m3 [row, col] = sum;
}
}
}
Console.WriteLine (DateTime.Now - timer);
}
显示以下结果:
00:00:02.4035870
00:00:57.5147150
在Mac OS X / Mono下,性能提升了23倍,谢谢大家
答案 0 :(得分:6)
我不确定大多数效率,但这里有一些尝试。以下指令序列计算乘积A * T'的主对角线。将T和D旋转8位并重复7次迭代。
// uint64_t A, T;
uint64_t D = UINT64_C(0x8040201008040201);
uint64_t P = A & T;
// test whether each byte is nonzero
P |= P >> 1;
P |= P >> 2;
P |= P >> 4;
P &= UINT64_C(0x0101010101010101);
// fill each nonzero byte with ones
P *= 255; // or P = (P << 8) - P;
// leave only the current diagonal
P &= D;
答案 1 :(得分:2)
如果您正在寻找一种并行执行密集矩阵乘法的方法,请将结果矩阵划分为块并并行计算每个块。
http://en.wikipedia.org/wiki/Block_matrix#Block_matrix_multiplication
答案 2 :(得分:2)
目前尚不清楚您使用的是哪种数据结构,哪种语言(是的,我知道您说'任何语言'),以及您想要优化的内容(速度?内存?)等等所有这些都可能具有深刻意义对您的解决方案的影响
一些例子:
|
)代替+
。有些语言可能会懒惰地对此进行评估,并在遇到的第一个“1”处停止。答案 3 :(得分:1)
如果您允许使用比C / C ++更底层的结构,则SSE / AVX机器指令与固有的编译器功能一起可以编写更快的代码(根据我制定的某些基准,它的编写速度是4倍)。您需要使用非标准向量变量(至少受GCC,ICC和CLang支持):
using epu = uint8_t __attribute__((vector_size(16)));
我正在使用诸如
的类class BMat8 {
[...]
private:
uint64_t _data;
};
然后,下面的代码应做您想要的
static constexpr epu rothigh { 0, 1, 2, 3, 4, 5, 6, 7,15, 8, 9,10,11,12,13,14};
static constexpr epu rot2 { 6, 7, 0, 1, 2, 3, 4, 5,14,15, 8, 9,10,11,12,13};
inline BMat8 operator*(BMat8 const& tr) const {
epu x = _mm_set_epi64x(_data, _data);
epu y = _mm_shuffle_epi8(_mm_set_epi64x(tr._data, tr._data), rothigh);
epu data {};
epu diag = {0x01,0x02,0x04,0x08,0x10,0x20,0x40,0x80,
0x80,0x01,0x02,0x04,0x08,0x10,0x20,0x40};
for (int i = 0; i < 4; ++i) {
data |= ((x & y) != epu {}) & diag;
y = _mm_shuffle_epi8(y, rot2);
diag = _mm_shuffle_epi8(diag, rot2);
}
return BMat8(_mm_extract_epi64(data, 0) | _mm_extract_epi64(data, 1));
}
特别是,使用128位寄存器,我能够一次进行两次迭代。
答案 4 :(得分:0)
使用我在这里描述的解决方案,可以在x86-64上非常有效地实现严格布尔代数的解决方案:
https://stackoverflow.com/a/55307540/11147804
唯一的区别是,转置矩阵中的数据也需要按列提取,并在每个64位乘积之前重新打包为行。幸运的是,使用BMI2指令进行并行位提取很简单,可以使用固有的_pext_u64在GCC上进行访问:
uint64_t mul8x8T (uint64_t A, uint64_t B) {
const uint64_t COL = 0x0101010101010101;
uint64_t C = 0;
for (int i=0; i<8; ++i) {
uint64_t p = COL & (A>>i); // select column
uint64_t r = torow( COL & (B>>i) );
C |= (p*r); // use ^ for GF(2) instead
}
return C;
}
uint64_t torow (uint64_t c) {
const uint64_t ROW = 0x00000000000000FF; // mask of the first row
const uint64_t COL = 0x0101010101010101; // mask of the first column
// select bits of c in positions marked by COL,
// and pack them consecutively
// last 'and' is included for clarity and is not
// really necessary
return _pext_u64(c, COL) & ROW;
}
在不支持该特定指令的处理器中,一种可能的解决方案是调整典型的打包位技巧,例如,在使用64位乘法的字节的经典位顺序反转中使用:
https://graphics.stanford.edu/~seander/bithacks.html#ReverseByteWith64BitsDiv
使用掩码和具有一定常数的整数乘法,将得到一个包含打包结果作为位子字符串的四字,然后可以使用位移和掩码将其提取出来。
想法是将乘法步骤视为并行位移,其中输入中的每个位都以常数指定的不同量位移。只要这两个数的步幅不碰撞结果中的某个位置,即只要来自乘法的每个部分和更新结果中的不同位位置,这总是可能的。这样可以避免任何潜在的进位,这使得位和等于比特并行或(或XOR)。
uint64_t torow (uint64_t c) {
const uint64_t ROW = 0x00000000000000FF; // select 8 lowest consecutive bits to get the first row
const uint64_t COL = 0x0101010101010101; // select every 8th bit to get the first column
const uint64_t DIA = 0x8040201008040201; // select every 8+1 bit to obtain a diagonal
c *= ROW; // "copies" first column to the rest
c &= DIA; // only use diagonal bits or else there will be position collisions and unexpected carries
c *= COL; // "scatters" every bit to all rows after itself; the last row will now contain the packed bits
return c >> 56; // move last row to first & discard the rest
}
此功能还有其他可能的替代实现,它们使用更多强度较低的操作,而最快的操作将取决于目标体系结构。