我刚刚阅读了此博客http://lemire.me/blog/archives/2012/06/20/do-not-waste-time-with-stl-vectors/,比较operator[]
作业和push_back
在预先保留的内存std::vector
上的效果,我决定自己尝试一下。操作很简单:
// for vector
bigarray.reserve(N);
// START TIME TRACK
for(int k = 0; k < N; ++k)
// for operator[]:
// bigarray[k] = k;
// for push_back
bigarray.push_back(k);
// END TIME TRACK
// do some dummy operations to prevent compiler optimize
long sum = accumulate(begin(bigarray), end(array),0 0);
结果如下:
~/t/benchmark> icc 1.cpp -O3 -std=c++11
~/t/benchmark> ./a.out
[ 1.cpp: 52] 0.789123s --> C++ new
[ 1.cpp: 52] 0.774049s --> C++ new
[ 1.cpp: 66] 0.351176s --> vector
[ 1.cpp: 80] 1.801294s --> reserve + push_back
[ 1.cpp: 94] 1.753786s --> reserve + emplace_back
[ 1.cpp: 107] 2.815756s --> no reserve + push_back
~/t/benchmark> clang++ 1.cpp -std=c++11 -O3
~/t/benchmark> ./a.out
[ 1.cpp: 52] 0.592318s --> C++ new
[ 1.cpp: 52] 0.566979s --> C++ new
[ 1.cpp: 66] 0.270363s --> vector
[ 1.cpp: 80] 1.763784s --> reserve + push_back
[ 1.cpp: 94] 1.761879s --> reserve + emplace_back
[ 1.cpp: 107] 2.815596s --> no reserve + push_back
~/t/benchmark> g++ 1.cpp -O3 -std=c++11
~/t/benchmark> ./a.out
[ 1.cpp: 52] 0.617995s --> C++ new
[ 1.cpp: 52] 0.601746s --> C++ new
[ 1.cpp: 66] 0.270533s --> vector
[ 1.cpp: 80] 1.766538s --> reserve + push_back
[ 1.cpp: 94] 1.998792s --> reserve + emplace_back
[ 1.cpp: 107] 2.815617s --> no reserve + push_back
对于所有编译器,vector
operator[]
比使用operator[]
的原始指针快得多。这导致了第一个问题:为什么?第二个问题是,我已经“保留”了内存,为什么opeator[]
更快?
接下来的问题是,由于内存已经分配,为什么push_back
比operator[]
慢?
测试代码如下:
#include <iostream>
#include <iomanip>
#include <vector>
#include <numeric>
#include <chrono>
#include <string>
#include <cstring>
#define PROFILE(BLOCK, ROUTNAME) ProfilerRun([&](){do {BLOCK;} while(0);}, \
ROUTNAME, __FILE__, __LINE__);
template <typename T>
void ProfilerRun (T&& func, const std::string& routine_name = "unknown",
const char* file = "unknown", unsigned line = 0)
{
using std::chrono::duration_cast;
using std::chrono::microseconds;
using std::chrono::steady_clock;
using std::cerr;
using std::endl;
steady_clock::time_point t_begin = steady_clock::now();
// Call the function
func();
steady_clock::time_point t_end = steady_clock::now();
cerr << "[" << std::setw (20)
<< (std::strrchr (file, '/') ?
std::strrchr (file, '/') + 1 : file)
<< ":" << std::setw (5) << line << "] "
<< std::setw (10) << std::setprecision (6) << std::fixed
<< static_cast<float> (duration_cast<microseconds>
(t_end - t_begin).count()) / 1e6
<< "s --> " << routine_name << endl;
cerr.unsetf (std::ios_base::floatfield);
}
using namespace std;
const int N = (1 << 29);
int routine1()
{
int sum;
int* bigarray = new int[N];
PROFILE (
{
for (unsigned int k = 0; k < N; ++k)
bigarray[k] = k;
}, "C++ new");
sum = std::accumulate (bigarray, bigarray + N, 0);
delete [] bigarray;
return sum;
}
int routine2()
{
int sum;
vector<int> bigarray (N);
PROFILE (
{
for (unsigned int k = 0; k < N; ++k)
bigarray[k] = k;
}, "vector");
sum = std::accumulate (begin (bigarray), end (bigarray), 0);
return sum;
}
int routine3()
{
int sum;
vector<int> bigarray;
bigarray.reserve (N);
PROFILE (
{
for (unsigned int k = 0; k < N; ++k)
bigarray.push_back (k);
}, "reserve + push_back");
sum = std::accumulate (begin (bigarray), end (bigarray), 0);
return sum;
}
int routine4()
{
int sum;
vector<int> bigarray;
bigarray.reserve (N);
PROFILE (
{
for (unsigned int k = 0; k < N; ++k)
bigarray.emplace_back(k);
}, "reserve + emplace_back");
sum = std::accumulate (begin (bigarray), end (bigarray), 0);
return sum;
}
int routine5()
{
int sum;
vector<int> bigarray;
PROFILE (
{
for (unsigned int k = 0; k < N; ++k)
bigarray.push_back (k);
}, "no reserve + push_back");
sum = std::accumulate (begin (bigarray), end (bigarray), 0);
return sum;
}
int main()
{
long s0 = routine1();
long s1 = routine1();
long s2 = routine2();
long s3 = routine3();
long s4 = routine4();
long s5 = routine5();
return int (s1 + s2);
}
答案 0 :(得分:43)
push_back
进行边界检查。 operator[]
没有。因此,即使您保留了空格,push_back
也会进行额外的条件检查operator[]
。此外,它会增加size
值(保留仅设置capacity
),因此每次都会更新。{/ p>
简而言之,push_back
比operator[]
正在做的更多 - 这就是它更慢(更准确)的原因。
答案 1 :(得分:25)
正如Yakk和我所知,可能还有另一个有趣的因素导致push_back
明显缓慢。
第一个有趣的观察是,在原始测试中,使用new
并在原始数组上操作更慢比使用vector<int> bigarray(N);
和operator[]
- 超过一个因素2.更有趣的是,通过为原始数组变体插入附加 memset
,您可以获得相同的性能:
int routine1_modified()
{
int sum;
int* bigarray = new int[N];
memset(bigarray, 0, sizeof(int)*N);
PROFILE (
{
for (unsigned int k = 0; k < N; ++k)
bigarray[k] = k;
}, "C++ new");
sum = std::accumulate (bigarray, bigarray + N, 0);
delete [] bigarray;
return sum;
}
当然,结论是,PROFILE
衡量的是与预期不同的东西。 Yakk和我猜它与内存管理有关;从Yakk对OP的评论:
resize
将触及整个内存块。reserve
将分配而不会触及。如果你有一个懒惰的分配器,在访问之前不会获取或分配物理内存页,空向量上的reserve
几乎是免费的(甚至不必为页面找到物理内存!),直到你写到页面(此时,必须找到它们)。
我想到了类似的东西,所以通过触摸某些带有“strided memset”的页面(一个分析工具可能会获得更可靠的结果)尝试对此假设进行小规模测试:
int routine1_modified2()
{
int sum;
int* bigarray = new int[N];
for(int k = 0; k < N; k += PAGESIZE*2/sizeof(int))
bigarray[k] = 0;
PROFILE (
{
for (unsigned int k = 0; k < N; ++k)
bigarray[k] = k;
}, "C++ new");
sum = std::accumulate (bigarray, bigarray + N, 0);
delete [] bigarray;
return sum;
}
通过将每半页的步幅改为每4页以完全离开它,我们可以从vector<int> bigarray(N);
获得良好的时序转换new int[N]
案例中没有使用memset
的情况。
在我看来,这是一个强烈暗示,内存管理是测量结果的主要贡献者。
另一个问题是push_back
中的分支。在许多答案中声称,与使用push_back
相比,这是operator[]
更慢的主要原因。实际上,将原始指针w / o memset与使用reserve
+ push_back
进行比较,前者的速度提高了两倍。
同样,如果我们添加一些UB(但稍后检查结果):
int routine3_modified()
{
int sum;
vector<int> bigarray;
bigarray.reserve (N);
memset(bigarray.data(), 0, sizeof(int)*N); // technically, it's UB
PROFILE (
{
for (unsigned int k = 0; k < N; ++k)
bigarray.push_back (k);
}, "reserve + push_back");
sum = std::accumulate (begin (bigarray), end (bigarray), 0);
return sum;
}
此修改后的版本比使用new
+完整memset
慢约2倍。因此,无论push_back
的调用如何,与仅在2
和原始数组中设置元素(通过operator[]
相比,它会导致因子vector
减速情况)。
但它是push_back
所需的分支,还是附加操作?
// pseudo-code
void push_back(T const& p)
{
if(size() == capacity())
{
resize( size() < 10 ? 10 : size()*2 );
}
(*this)[size()] = p; // actually using the allocator
++m_end;
}
确实很简单,例如libstdc++'s implementation
我已使用vector<int> bigarray(N);
+ operator[]
变体对其进行了测试,并插入了模仿push_back
行为的函数调用:
unsigned x = 0;
void silly_branch(int k)
{
if(k == x)
{
x = x < 10 ? 10 : x*2;
}
}
int routine2_modified()
{
int sum;
vector<int> bigarray (N);
PROFILE (
{
for (unsigned int k = 0; k < N; ++k)
{
silly_branch(k);
bigarray[k] = k;
}
}, "vector");
sum = std::accumulate (begin (bigarray), end (bigarray), 0);
return sum;
}
即使将x
声明为易变,这只会对测量产生1%的影响。当然,您必须验证分支实际上是中的操作码,但我的汇编程序知识不允许我验证(在-O3
)。
现在有趣的一点是,当我向silly_branch
添加增量时会发生什么:
unsigned x = 0;
void silly_branch(int k)
{
if(k == x)
{
x = x < 10 ? 10 : x*2;
}
++x;
}
现在,修改后的routine2_modified
运行速度比原始routine2
慢2倍,与上面提议的routine3_modified
相同,包括提交内存页面的UB。我没有发现这特别令人惊讶,因为它为循环中的每个写入添加了另一个写入,因此我们有两倍的工作时间和两倍的持续时间。
结论
那么你必须仔细查看汇编和分析工具来验证内存管理的假设,而额外的写入是一个很好的假设(“正确”)。但我认为这些提示非常强大,足以声称有一些更复杂的事情,而不仅仅是使push_back
更慢的分支。
以下是完整的测试代码:
#include <iostream>
#include <iomanip>
#include <vector>
#include <numeric>
#include <chrono>
#include <string>
#include <cstring>
#define PROFILE(BLOCK, ROUTNAME) ProfilerRun([&](){do {BLOCK;} while(0);}, \
ROUTNAME, __FILE__, __LINE__);
//#define PROFILE(BLOCK, ROUTNAME) BLOCK
template <typename T>
void ProfilerRun (T&& func, const std::string& routine_name = "unknown",
const char* file = "unknown", unsigned line = 0)
{
using std::chrono::duration_cast;
using std::chrono::microseconds;
using std::chrono::steady_clock;
using std::cerr;
using std::endl;
steady_clock::time_point t_begin = steady_clock::now();
// Call the function
func();
steady_clock::time_point t_end = steady_clock::now();
cerr << "[" << std::setw (20)
<< (std::strrchr (file, '/') ?
std::strrchr (file, '/') + 1 : file)
<< ":" << std::setw (5) << line << "] "
<< std::setw (10) << std::setprecision (6) << std::fixed
<< static_cast<float> (duration_cast<microseconds>
(t_end - t_begin).count()) / 1e6
<< "s --> " << routine_name << endl;
cerr.unsetf (std::ios_base::floatfield);
}
using namespace std;
constexpr int N = (1 << 28);
constexpr int PAGESIZE = 4096;
uint64_t __attribute__((noinline)) routine1()
{
uint64_t sum;
int* bigarray = new int[N];
PROFILE (
{
for (int k = 0, *p = bigarray; p != bigarray+N; ++p, ++k)
*p = k;
}, "new (routine1)");
sum = std::accumulate (bigarray, bigarray + N, 0ULL);
delete [] bigarray;
return sum;
}
uint64_t __attribute__((noinline)) routine2()
{
uint64_t sum;
int* bigarray = new int[N];
memset(bigarray, 0, sizeof(int)*N);
PROFILE (
{
for (int k = 0, *p = bigarray; p != bigarray+N; ++p, ++k)
*p = k;
}, "new + full memset (routine2)");
sum = std::accumulate (bigarray, bigarray + N, 0ULL);
delete [] bigarray;
return sum;
}
uint64_t __attribute__((noinline)) routine3()
{
uint64_t sum;
int* bigarray = new int[N];
for(int k = 0; k < N; k += PAGESIZE/2/sizeof(int))
bigarray[k] = 0;
PROFILE (
{
for (int k = 0, *p = bigarray; p != bigarray+N; ++p, ++k)
*p = k;
}, "new + strided memset (every page half) (routine3)");
sum = std::accumulate (bigarray, bigarray + N, 0ULL);
delete [] bigarray;
return sum;
}
uint64_t __attribute__((noinline)) routine4()
{
uint64_t sum;
int* bigarray = new int[N];
for(int k = 0; k < N; k += PAGESIZE/1/sizeof(int))
bigarray[k] = 0;
PROFILE (
{
for (int k = 0, *p = bigarray; p != bigarray+N; ++p, ++k)
*p = k;
}, "new + strided memset (every page) (routine4)");
sum = std::accumulate (bigarray, bigarray + N, 0ULL);
delete [] bigarray;
return sum;
}
uint64_t __attribute__((noinline)) routine5()
{
uint64_t sum;
int* bigarray = new int[N];
for(int k = 0; k < N; k += PAGESIZE*2/sizeof(int))
bigarray[k] = 0;
PROFILE (
{
for (int k = 0, *p = bigarray; p != bigarray+N; ++p, ++k)
*p = k;
}, "new + strided memset (every other page) (routine5)");
sum = std::accumulate (bigarray, bigarray + N, 0ULL);
delete [] bigarray;
return sum;
}
uint64_t __attribute__((noinline)) routine6()
{
uint64_t sum;
int* bigarray = new int[N];
for(int k = 0; k < N; k += PAGESIZE*4/sizeof(int))
bigarray[k] = 0;
PROFILE (
{
for (int k = 0, *p = bigarray; p != bigarray+N; ++p, ++k)
*p = k;
}, "new + strided memset (every 4th page) (routine6)");
sum = std::accumulate (bigarray, bigarray + N, 0ULL);
delete [] bigarray;
return sum;
}
uint64_t __attribute__((noinline)) routine7()
{
uint64_t sum;
vector<int> bigarray (N);
PROFILE (
{
for (int k = 0; k < N; ++k)
bigarray[k] = k;
}, "vector, using ctor to initialize (routine7)");
sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL);
return sum;
}
uint64_t __attribute__((noinline)) routine8()
{
uint64_t sum;
vector<int> bigarray;
PROFILE (
{
for (int k = 0; k < N; ++k)
bigarray.push_back (k);
}, "vector (+ no reserve) + push_back (routine8)");
sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL);
return sum;
}
uint64_t __attribute__((noinline)) routine9()
{
uint64_t sum;
vector<int> bigarray;
bigarray.reserve (N);
PROFILE (
{
for (int k = 0; k < N; ++k)
bigarray.push_back (k);
}, "vector + reserve + push_back (routine9)");
sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL);
return sum;
}
uint64_t __attribute__((noinline)) routine10()
{
uint64_t sum;
vector<int> bigarray;
bigarray.reserve (N);
memset(bigarray.data(), 0, sizeof(int)*N);
PROFILE (
{
for (int k = 0; k < N; ++k)
bigarray.push_back (k);
}, "vector + reserve + memset (UB) + push_back (routine10)");
sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL);
return sum;
}
template<class T>
void __attribute__((noinline)) adjust_size(std::vector<T>& v, int k, double factor)
{
if(k >= v.size())
{
v.resize(v.size() < 10 ? 10 : k*factor);
}
}
uint64_t __attribute__((noinline)) routine11()
{
uint64_t sum;
vector<int> bigarray;
PROFILE (
{
for (int k = 0; k < N; ++k)
{
adjust_size(bigarray, k, 1.5);
bigarray[k] = k;
}
}, "vector + custom emplace_back @ factor 1.5 (routine11)");
sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL);
return sum;
}
uint64_t __attribute__((noinline)) routine12()
{
uint64_t sum;
vector<int> bigarray;
PROFILE (
{
for (int k = 0; k < N; ++k)
{
adjust_size(bigarray, k, 2);
bigarray[k] = k;
}
}, "vector + custom emplace_back @ factor 2 (routine12)");
sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL);
return sum;
}
uint64_t __attribute__((noinline)) routine13()
{
uint64_t sum;
vector<int> bigarray;
PROFILE (
{
for (int k = 0; k < N; ++k)
{
adjust_size(bigarray, k, 3);
bigarray[k] = k;
}
}, "vector + custom emplace_back @ factor 3 (routine13)");
sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL);
return sum;
}
uint64_t __attribute__((noinline)) routine14()
{
uint64_t sum;
vector<int> bigarray;
PROFILE (
{
for (int k = 0; k < N; ++k)
bigarray.emplace_back (k);
}, "vector (+ no reserve) + emplace_back (routine14)");
sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL);
return sum;
}
uint64_t __attribute__((noinline)) routine15()
{
uint64_t sum;
vector<int> bigarray;
bigarray.reserve (N);
PROFILE (
{
for (int k = 0; k < N; ++k)
bigarray.emplace_back (k);
}, "vector + reserve + emplace_back (routine15)");
sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL);
return sum;
}
uint64_t __attribute__((noinline)) routine16()
{
uint64_t sum;
vector<int> bigarray;
bigarray.reserve (N);
memset(bigarray.data(), 0, sizeof(bigarray[0])*N);
PROFILE (
{
for (int k = 0; k < N; ++k)
bigarray.emplace_back (k);
}, "vector + reserve + memset (UB) + emplace_back (routine16)");
sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL);
return sum;
}
unsigned x = 0;
template<class T>
void /*__attribute__((noinline))*/ silly_branch(std::vector<T>& v, int k)
{
if(k == x)
{
x = x < 10 ? 10 : x*2;
}
//++x;
}
uint64_t __attribute__((noinline)) routine17()
{
uint64_t sum;
vector<int> bigarray(N);
PROFILE (
{
for (int k = 0; k < N; ++k)
{
silly_branch(bigarray, k);
bigarray[k] = k;
}
}, "vector, using ctor to initialize + silly branch (routine17)");
sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL);
return sum;
}
template<class T, int N>
constexpr int get_extent(T(&)[N])
{ return N; }
int main()
{
uint64_t results[] = {routine2(),
routine1(),
routine2(),
routine3(),
routine4(),
routine5(),
routine6(),
routine7(),
routine8(),
routine9(),
routine10(),
routine11(),
routine12(),
routine13(),
routine14(),
routine15(),
routine16(),
routine17()};
std::cout << std::boolalpha;
for(int i = 1; i < get_extent(results); ++i)
{
std::cout << i << ": " << (results[0] == results[i]) << "\n";
}
std::cout << x << "\n";
}
一个样本运行,旧的&amp;慢电脑;注意:
N == 2<<28
,而不是OP中的2<<29
-std=c++11 -O3 -march=native
[ temp.cpp: 71] 0.654927s --> new + full memset (routine2) [ temp.cpp: 54] 1.042405s --> new (routine1) [ temp.cpp: 71] 0.605061s --> new + full memset (routine2) [ temp.cpp: 89] 0.597487s --> new + strided memset (every page half) (routine3) [ temp.cpp: 107] 0.601271s --> new + strided memset (every page) (routine4) [ temp.cpp: 125] 0.783610s --> new + strided memset (every other page) (routine5) [ temp.cpp: 143] 0.903038s --> new + strided memset (every 4th page) (routine6) [ temp.cpp: 157] 0.602401s --> vector, using ctor to initialize (routine7) [ temp.cpp: 170] 3.811291s --> vector (+ no reserve) + push_back (routine8) [ temp.cpp: 184] 2.091391s --> vector + reserve + push_back (routine9) [ temp.cpp: 199] 1.375837s --> vector + reserve + memset (UB) + push_back (routine10) [ temp.cpp: 224] 8.738293s --> vector + custom emplace_back @ factor 1.5 (routine11) [ temp.cpp: 240] 5.513803s --> vector + custom emplace_back @ factor 2 (routine12) [ temp.cpp: 256] 5.150388s --> vector + custom emplace_back @ factor 3 (routine13) [ temp.cpp: 269] 3.789820s --> vector (+ no reserve) + emplace_back (routine14) [ temp.cpp: 283] 2.090259s --> vector + reserve + emplace_back (routine15) [ temp.cpp: 298] 1.288740s --> vector + reserve + memset (UB) + emplace_back (routine16) [ temp.cpp: 325] 0.611168s --> vector, using ctor to initialize + silly branch (routine17) 1: true 2: true 3: true 4: true 5: true 6: true 7: true 8: true 9: true 10: true 11: true 12: true 13: true 14: true 15: true 16: true 17: true 335544320
答案 2 :(得分:8)
在构造函数中分配数组时,编译器/库基本上可以memset()
原始填充,然后只需设置每个单独的值。当您使用push_back()
时,std::vector<T>
类需要:
最后一步是在一次性分配内存时唯一需要做的事情。
答案 3 :(得分:3)
我可以回答你的第二个问题。虽然向量是预先分配的,但每次调用push_back时,push_back仍然需要检查可用空间。另一方面,operator []不执行任何检查,只是假设空间可用。
答案 4 :(得分:1)
这是一个扩展的评论,而非答案,旨在帮助改善问题。
例程4调用未定义的行为。您正在写过数组size
的末尾。用resize替换reserve以消除它。
例程3到5在优化后无效,因为它们没有可观察的输出。
insert( vec.end(), src.begin(), src.end() )
其中src
是随机访问生成器范围(boost
可能拥有它),如果new
是智能的,则可能会模拟insert
版本。
routine1
的复制似乎很有趣 - 这会改变时间吗?