我在代码中有以下关键位置:我需要从大约1' 000&000,000次的64字节数组中查找。
最小代码:
#include <iostream>
#include <stdint.h>
#include <random>
#include <chrono>
#include <ctime>
#define TYPE uint8_t
#define n_lookup 64
int main(){
const int n_indices = 1000000;
TYPE lookup[n_lookup];
TYPE indices[n_indices];
TYPE result[n_indices];
// preparations
std::default_random_engine generator;
std::uniform_int_distribution<int> distribution(0, n_lookup);
for (int i=0; i < n_indices; i++) indices[i] = distribution(generator);
for (int i=0; i < n_lookup; i++) lookup[i] = distribution(generator);
std::chrono::time_point<std::chrono::system_clock> start = std::chrono::system_clock::now();
// main loop:
for (int i=0; i < n_indices; i++) {
result[i] = lookup[indices[i]];
}
std::chrono::time_point<std::chrono::system_clock> end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
std::cout << "computation took " << elapsed_seconds.count() * 1e9 / n_indices << " ns per element"<< std::endl;
// printing random numbers to avoid code elimination
std::cout << result[12] << result[45];
return 0;
}
使用g++ lookup.cpp -std=gnu++11 -O3 -funroll-loops
进行编译后,在现代CPU上,每个元素的值不到1ns。
我需要这个操作加快2-3倍(没有线程)。我怎么能这样做?
P.S。我也在调查AVX512(512位正是查找表的大小!)指令集,但它缺少8位聚集操作!
答案 0 :(得分:3)
indices
和result
向量位于内存中的不同位置,但同时访问。它导致缓存未命中。我建议你在一个向量中合并结果和索引。这是代码:
#include <iostream>
#include <stdint.h>
#include <random>
#include <chrono>
#include <ctime>
#define TYPE uint8_t
#define n_lookup 64
int main(){
const int n_indices = 2000000;
TYPE lookup[n_lookup];
// Merge indices and result
// If i is index, then i+1 is result
TYPE ind_res[n_indices];
// preparations
std::default_random_engine generator;
std::uniform_int_distribution<int> distribution(0, n_lookup);
for (int i=0; i < n_indices; i += 2) ind_res[i] = distribution(generator);
for (int i=0; i < n_lookup; i++) lookup[i] = distribution(generator);
std::chrono::time_point<std::chrono::system_clock> start = std::chrono::system_clock::now();
// main loop:
for (int i=0; i < n_indices; i += 2) {
ind_res[i+1] = lookup[ind_res[i]]; // more dense access here, no cache-miss
}
std::chrono::time_point<std::chrono::system_clock> end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
std::cout << "computation took " << elapsed_seconds.count() * 1e9 / n_indices << " ns per element"<< std::endl;
// printing random numbers to avoid code elimination
std::cout << ind_res[24] << ind_res[90];
return 0;
}
我的测试显示此代码运行得更快。
答案 1 :(得分:2)
with -march = native这是你的循环编译成的:
movq %rax, %rbx
xorl %eax, %eax
.L145:
movzbl 128(%rsp,%rax), %edx
movzbl 64(%rsp,%rdx), %edx
movb %dl, 1000128(%rsp,%rax)
addq $1, %rax
cmpq $1000000, %rax
jne .L145
如果没有并行化,我很难看到如何更快。
通过将TYPE更改为int32_t,它会被矢量化:
vpcmpeqd %ymm2, %ymm2, %ymm2
movq %rax, %rbx
xorl %eax, %eax
.L145:
vmovdqa -8000048(%rbp,%rax), %ymm1
vmovdqa %ymm2, %ymm3
vpgatherdd %ymm3, -8000304(%rbp,%ymm1,4), %ymm0
vmovdqa %ymm0, -4000048(%rbp,%rax)
addq $32, %rax
cmpq $4000000, %rax
jne .L145
vzeroupper
可能有帮助吗?
答案 2 :(得分:2)
首先,有一个错误,分布(0,64)产生数字0到64,64不能适合数组。
您可以通过一次查找两个值来加快查找速度:
#include <iostream>
#include <stdint.h>
#include <random>
#include <chrono>
#include <ctime>
#define TYPE uint8_t
#define TYPE2 uint16_t
#define n_lookup 64
void tst() {
const int n_indices = 1000000;// has to be multiple of 2
TYPE lookup[n_lookup];
TYPE indices[n_indices];
TYPE result[n_indices];
TYPE2 lookup2[n_lookup * 256];
// preparations
std::default_random_engine generator;
std::uniform_int_distribution<int> distribution(0, n_lookup-1);
for (int i = 0; i < n_indices; i++) indices[i] = distribution(generator);
for (int i = 0; i < n_lookup; i++) lookup[i] = distribution(generator);
for (int i = 0; i < n_lookup; ++i) {
for (int j = 0; j < n_lookup; ++j) {
lookup2[(i << 8) | j] = (lookup[i] << 8) | lookup[j];
}
}
std::chrono::time_point<std::chrono::system_clock> start = std::chrono::system_clock::now();
TYPE2* indices2 = (TYPE2*)indices;
TYPE2* result2 = (TYPE2*)result;
// main loop:
for (int i = 0; i < n_indices / 2; ++i) {
*result2++ = lookup2[*indices2++];
}
std::chrono::time_point<std::chrono::system_clock> end = std::chrono::system_clock::now();
for (int i = 0; i < n_indices; i++) {
if (result[i] != lookup[indices[i]]) {
std::cout << "!!!!!!!!!!!!!ERROR!!!!!!!!!!!!!";
}
}
std::chrono::duration<double> elapsed_seconds = end - start;
std::cout << "computation took " << elapsed_seconds.count() * 1e9 / n_indices << " ns per element" << std::endl;
// printing random numbers to avoid code elimination
std::cout << result[12] << result[45];
}
int main() {
tst();
std::cin.get();
return 0;
}
答案 3 :(得分:0)
您的代码已经非常快了。然而 (在我的系统上)当你改变
时,执行速度大约快4.858%const int n_indices = 1048576; // 2^10
到
FBSDKLoginKit
这并不多,但它确实存在。