Question

我在问是否可以使用bitwise operations改进相当整数矩阵乘法。矩阵很小，元素是小的非负整数（小的意思是最多20个）。

为了让我们专注，让我们非常具体，并说我有两个3x3矩阵，整数项0 <= x <15。

以下天真的C ++实现执行了一百万次执行大约1秒，用linux time测量。

#include <random>

int main() {
//Random number generator
std::random_device rd;
std::mt19937 eng(rd());
std::uniform_int_distribution<> distr(0, 15);

int A[3][3];
int B[3][3];
int C[3][3];
for (int trials = 0; trials <= 1000000; trials++) {
    //Set up A[] and B[]
    for (int i = 0; i < 3; ++i) {
        for (int j = 0; j < 3; ++j) {
            A[i][j] = distr(eng);
            B[i][j] = distr(eng);
            C[i][j] = 0;
        }
    }
    //Compute C[]=A[]*B[]
    for (int i = 0; i < 3; ++i) {
        for (int j = 0; j < 3; ++j) {
            for (int k = 0; k < 3; ++k) {
                C[i][j] = C[i][j] + A[i][k] * B[k][j];
            }
        }
    }
}
return 0;
}

注意：

矩阵不一定稀疏。
Strassen-like评论对此没有帮助。
让我们尝试不使用间接观察，在此特定问题中，矩阵A[]和B[]可以是编码为单个 64位整数。想想只有更大的矩阵会发生什么。
计算是单线程的。

相关：Binary matrix multiplication bit twiddling hack和What is the optimal algorithm for the game 2048?

Answer 1

您链接的问题是关于矩阵，其中每个元素都是一个位。对于一位值a和b，a * b完全等同于a & b。

对于添加2位元素，它可能似乎合理（并且比解包更快）基本上从头开始添加，使用XOR（无进位添加），然后使用AND生成进位，移位和屏蔽关闭元素边界

第3位需要检测何时添加进位产生另一个进位。与使用SIMD相比，我认为甚至模拟3位加法器或乘法器也不会是一场胜利。没有SIMD（即在带有uint64_t的纯C中），它可能是有意义的。对于添加，您可以尝试使用普通添加，然后尝试撤消元素边界之间的进位，而不是自己使用XOR / AND / shift操作构建加法器。

打包与解包到字节存储格式

如果你有很多这些微小的矩阵，以压缩形式（例如打包的4位元素）将它们存储在内存中可以帮助缓存占用空间/内存带宽。 4位元素相当容易解压缩，使每个元素都在向量的单独字节元素中。

否则，每个字节存储一个矩阵元素。从那里，如果需要，您可以轻松地将它们解压缩到每个元素16位或32位，具体取决于目标SIMD指令集提供的元素大小。您可以将一些矩阵保存在解压缩格式的局部变量中，以便在乘法中重复使用，但是将它们打包回每个元素4位，以便存储在数组中。

编译器在x86 的标量C代码中使用uint8_t来解决这个问题。请参阅@ Richard的回答评论：gcc和clang都喜欢使用mul r8来uint8_t，这迫使他们将数据移动到eax（单操作数的隐式输入/输出乘法），而不是using imul r32, r32 and ignoring the garbage that leaves outside the low 8 bits of the destination register。

uint8_t版本实际上比uint16_t版本运行得慢，即使它有一半的缓存占用空间。

您可能会从某种SIMD中获得最佳效果。

英特尔SSSE3有一个vector byte multiply, but only with adding of adjacent elements。使用它需要将矩阵解压缩到行之间有一些零的向量，因此您不会从一行中获取数据与另一行中的数据混合。幸运的是，pshufb可以将元素归零，也可以复制它们。

如果你在一个单独的16位向量元素中解压缩每个矩阵元素，那么SSE2 PMADDWD更有可能是有用的。因此，在一个向量中有一行，在另一个向量中有一个转置列，pmaddwd（_mm_madd_epi16）是一个水平add，远离为您提供{q}所需的点积结果{1}}。

您可以将多个C[i][j]结果打包到一个向量中，而不是单独添加每个pmaddwd结果，这样您就可以一次性存储C[i][0..2]。

Answer 2

如果您在大量矩阵上执行此计算，您可能会发现减小数据大小可以显着提高性能：

#include <cstdint>
#include <cstdlib>

using T = std::uint_fast8_t;

void mpy(T A[3][3], T B[3][3], T C[3][3])
{
for (int i = 0; i < 3; ++i) {
        for (int j = 0; j < 3; ++j) {
            for (int k = 0; k < 3; ++k) {
                C[i][j] = C[i][j] + A[i][k] * B[k][j];
            }
        }
    }
}

奔腾可以在一条指令中移动并对8位值进行符号扩展。这意味着每个缓存行的基质数量是4倍。

更新：好奇心激动，我写了一个测试：

#include <random>
#include <utility>
#include <algorithm>
#include <chrono>
#include <iostream>
#include <typeinfo>

template<class T>
struct matrix
{
    static constexpr std::size_t rows = 3;
    static constexpr std::size_t cols = 3;
    static constexpr std::size_t size() { return rows * cols; }

    template<class Engine, class U>
    matrix(Engine& engine, std::uniform_int_distribution<U>& dist)
    : matrix(std::make_index_sequence<size()>(), engine, dist)
    {}

    template<class U>
    matrix(std::initializer_list<U> li)
    : matrix(std::make_index_sequence<size()>(), li)
    {

    }

    matrix()
    : _data { 0 }
    {}

    const T* operator[](std::size_t i) const {
        return std::addressof(_data[i * cols]);
    }

    T* operator[](std::size_t i) {
        return std::addressof(_data[i * cols]);
    }

private:

    template<std::size_t...Is, class U, class Engine>
    matrix(std::index_sequence<Is...>, Engine& eng, std::uniform_int_distribution<U>& dist)
    : _data { (void(Is), dist(eng))... }
    {}

    template<std::size_t...Is, class U>
    matrix(std::index_sequence<Is...>, std::initializer_list<U> li)
    : _data { ((Is < li.size()) ? *(li.begin() + Is) : 0)... }
    {}


    T _data[rows * cols];
};

template<class T>
matrix<T> operator*(const matrix<T>& A, const matrix<T>& B)
{
    matrix<T> C;
    for (int i = 0; i < 3; ++i) {
        for (int j = 0; j < 3; ++j) {
            for (int k = 0; k < 3; ++k) {
                C[i][j] = C[i][j] + A[i][k] * B[k][j];
            }
        }
    }
    return C;
}

static constexpr std::size_t test_size = 1000000;
template<class T, class Engine>
void fill(std::vector<matrix<T>>& v, Engine& eng, std::uniform_int_distribution<T>& dist)
{
    v.clear();
    v.reserve(test_size);
    generate_n(std::back_inserter(v), test_size,
               [&] { return matrix<T>(eng, dist); });
}

template<class T>
void test(std::random_device& rd)
{
    std::mt19937 eng(rd());
    std::uniform_int_distribution<T> distr(0, 15);

    std::vector<matrix<T>> As, Bs, Cs;
    fill(As, eng, distr);
    fill(Bs, eng, distr);
    fill(Cs, eng, distr);

    auto start = std::chrono::high_resolution_clock::now();
    auto ia = As.cbegin();
    auto ib = Bs.cbegin();
    for (auto&m : Cs)
    {
        m = *ia++ * *ib++;
    }
    auto stop = std::chrono::high_resolution_clock::now();

    auto diff = stop - start;
    auto millis = std::chrono::duration_cast<std::chrono::microseconds>(diff).count();
    std::cout << "for type " << typeid(T).name() << " time is " << millis << "us" << std::endl;

}

int main() {
    //Random number generator
    std::random_device rd;
    test<std::uint64_t>(rd);
    test<std::uint32_t>(rd);
    test<std::uint16_t>(rd);
    test<std::uint8_t>(rd);
}

示例输出（最近的macbook pro，64位，使用-O3编译）

for type y time is 32787us
for type j time is 15323us
for type t time is 14347us
for type h time is 31550us

摘要：

在这个平台上，int32和int16被证明与对方一样快。 int64和int8同样很慢（8位结果让我感到惊讶）。

结论：

与往常一样，向编译器表达意图并让优化器做其事。如果程序在生产中运行得太慢，请进行测量并优化最坏的违规者。

快速整数矩阵乘法与bit-twiddling hacks

2 个答案:

打包与解包到字节存储格式

您可能会从某种SIMD中获得最佳效果。