对哈希进行简单的性能测试,看来C ++版本比perl版本和golang版本都慢。
在我的电脑上使用Core(TM)i7-2670QM CPU @ 2.20GHz,Ubuntu 14.04.3LTS,
有什么想法吗?
perl版
use Time::HiRes qw( usleep ualarm gettimeofday tv_interval nanosleep
clock_gettime clock_getres clock_nanosleep clock
stat );
sub getTS {
my ($seconds, $microseconds) = gettimeofday;
return $seconds + (0.0+ $microseconds)/1000000.0;
}
my %mymap;
$mymap{"U.S."} = "Washington";
$mymap{"U.K."} = "London";
$mymap{"France"} = "Paris";
$mymap{"Russia"} = "Moscow";
$mymap{"China"} = "Beijing";
$mymap{"Germany"} = "Berlin";
$mymap{"Japan"} = "Tokyo";
$mymap{"China"} = "Beijing";
$mymap{"Italy"} = "Rome";
$mymap{"Spain"} = "Madrad";
$x = "";
$start = getTS();
for ($i=0; $i<1000000; $i++) {
$x = $mymap{"China"};
}
printf "took %f sec\n", getTS() - $start;
C ++版
#include <iostream>
#include <string>
#include <unordered_map>
#include <sys/time.h>
double getTS() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec/1000000.0;
}
using namespace std;
int main () {
std::unordered_map<std::string,std::string> mymap;
// populating container:
mymap["U.S."] = "Washington";
mymap["U.K."] = "London";
mymap["France"] = "Paris";
mymap["Russia"] = "Moscow";
mymap["China"] = "Beijing";
mymap["Germany"] = "Berlin";
mymap["Japan"] = "Tokyo";
mymap["China"] = "Beijing";
mymap["Italy"] = "Rome";
mymap["Spain"] = "Madrad";
double start = getTS();
string x;
for (int i=0; i<1000000; i++) {
mymap["China"];
}
printf("took %f sec\n", getTS() - start);
return 0;
}
Golang版
package main
import "fmt"
import "time"
func main() {
var x string
mymap := make(map[string]string)
mymap["U.S."] = "Washington";
mymap["U.K."] = "London";
mymap["France"] = "Paris";
mymap["Russia"] = "Moscow";
mymap["China"] = "Beijing";
mymap["Germany"] = "Berlin";
mymap["Japan"] = "Tokyo";
mymap["China"] = "Beijing";
mymap["Italy"] = "Rome";
mymap["Spain"] = "Madrad";
t0 := time.Now()
sum := 1
for sum < 1000000 {
x = mymap["China"]
sum += 1
}
t1 := time.Now()
fmt.Printf("The call took %v to run.\n", t1.Sub(t0))
fmt.Println(x)
}
更新1
要改进C ++版本,请将x = mymap["China"];
更改为mymap["China"];
,但性能差异很小。
更新2
我在没有任何优化的情况下编译时得到了原始结果:g++ -std=c++11 unorderedMap.cc
。使用&#34; -O2&#34;优化,它只花费大约一半的时间(150毫秒)
更新3
要删除可能的char*
到string
构造函数调用,我创建了一个字符串常量。时间缩短到大约220ms(编译时没有优化)。感谢来自@ neil-kirk的建议,通过优化(-O2标志),时间大约是80ms。
double start = getTS();
string x = "China";
for (int i=0; i<1000000; i++) {
mymap[x];
}
更新4
感谢@ steffen-ullrich,他指出perl版本存在语法错误。我换了它。性能数字约为150毫秒。
更新5
执行指令的数量似乎很重要。使用命令valgrind --tool=cachegrind <cmd>
Go版本
$ valgrind --tool=cachegrind ./te1
==2103== Cachegrind, a cache and branch-prediction profiler
==2103== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==2103== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==2103== Command: ./te1
==2103==
--2103-- warning: L3 cache found, using its data for the LL simulation.
The call took 1.647099s to run.
Beijing
==2103==
==2103== I refs: 255,763,381
==2103== I1 misses: 3,709
==2103== LLi misses: 2,743
==2103== I1 miss rate: 0.00%
==2103== LLi miss rate: 0.00%
==2103==
==2103== D refs: 109,437,132 (77,838,331 rd + 31,598,801 wr)
==2103== D1 misses: 352,474 ( 254,714 rd + 97,760 wr)
==2103== LLd misses: 149,260 ( 96,250 rd + 53,010 wr)
==2103== D1 miss rate: 0.3% ( 0.3% + 0.3% )
==2103== LLd miss rate: 0.1% ( 0.1% + 0.1% )
==2103==
==2103== LL refs: 356,183 ( 258,423 rd + 97,760 wr)
==2103== LL misses: 152,003 ( 98,993 rd + 53,010 wr)
==2103== LL miss rate: 0.0% ( 0.0% + 0.1% )
对于C ++优化版本(无优化标志)
$ valgrind --tool=cachegrind ./a.out
==2180== Cachegrind, a cache and branch-prediction profiler
==2180== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==2180== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==2180== Command: ./a.out
==2180==
--2180-- warning: L3 cache found, using its data for the LL simulation.
took 64.657681 sec
==2180==
==2180== I refs: 5,281,474,482
==2180== I1 misses: 1,710
==2180== LLi misses: 1,651
==2180== I1 miss rate: 0.00%
==2180== LLi miss rate: 0.00%
==2180==
==2180== D refs: 3,170,495,683 (1,840,363,429 rd + 1,330,132,254 wr)
==2180== D1 misses: 12,055 ( 10,374 rd + 1,681 wr)
==2180== LLd misses: 7,383 ( 6,132 rd + 1,251 wr)
==2180== D1 miss rate: 0.0% ( 0.0% + 0.0% )
==2180== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==2180==
==2180== LL refs: 13,765 ( 12,084 rd + 1,681 wr)
==2180== LL misses: 9,034 ( 7,783 rd + 1,251 wr)
==2180== LL miss rate: 0.0% ( 0.0% + 0.0% )
对于C ++优化版
$ valgrind --tool=cachegrind ./a.out
==2157== Cachegrind, a cache and branch-prediction profiler
==2157== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==2157== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==2157== Command: ./a.out
==2157==
--2157-- warning: L3 cache found, using its data for the LL simulation.
took 9.419447 sec
==2157==
==2157== I refs: 1,451,459,660
==2157== I1 misses: 1,599
==2157== LLi misses: 1,549
==2157== I1 miss rate: 0.00%
==2157== LLi miss rate: 0.00%
==2157==
==2157== D refs: 430,486,197 (340,358,108 rd + 90,128,089 wr)
==2157== D1 misses: 12,008 ( 10,337 rd + 1,671 wr)
==2157== LLd misses: 7,372 ( 6,120 rd + 1,252 wr)
==2157== D1 miss rate: 0.0% ( 0.0% + 0.0% )
==2157== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==2157==
==2157== LL refs: 13,607 ( 11,936 rd + 1,671 wr)
==2157== LL misses: 8,921 ( 7,669 rd + 1,252 wr)
==2157== LL miss rate: 0.0% ( 0.0% + 0.0% )
答案 0 :(得分:15)
从您的Perl代码(尝试修复它之前):
@mymap = (); $mymap["U.S."] = "Washington"; $mymap["U.K."] = "London";
这不是地图而是数组。哈希映射的语法是:
%mymap;
$mymap{"U.S."} = ....
因此,您实际上要做的是创建一个数组而不是哈希映射,并始终访问元素0。
请一直使用use strict;
和use warnings;
使用Perl,即使是带警告的简单语法检查也会向您显示问题:
perl -cw x.pl
Argument "U.S." isn't numeric in array element at x.pl line 9.
Argument "U.K." isn't numeric in array element at x.pl line 10.
除此之外,基准测试的主要部分实际上没有任何用处(分配变量并且从不使用它),并且一些编译器可以检测它并简单地优化它。
如果要检查Perl程序生成的代码,您会看到:
$ perl -MO=Deparse x.pl
@mymap = ();
$mymap[0] = 'Washington';
$mymap[0] = 'London';
...
for ($i = 0; $i < 1000000; ++$i) {
$x = $mymap[0];
}
即它在编译时检测到问题并将其替换为对数组索引0的访问。
因此,无论何时基准测试都需要:
并且,使用简单的计时器不是一个现实的基准。系统上还有其他进程,有调度程序,有缓存垃圾......而今天的CPU很大程度上取决于系统的负载,因为CPU可能会在低功耗模式下运行一个基准测试比其他基准测试,即使用不同的CPU时钟。例如,同一&#34;基准&#34;的多次运行。在我的系统上测量的时间在100ms和150ms之间变化。
基准是谎言和微观基准,就像你的那样。
答案 1 :(得分:4)
我已经修改了你的例子以获得有关哈希表结构的一些细节:
#include <iostream>
#include <string>
#include <unordered_map>
#include <sys/time.h>
#include <chrono>
using namespace std;
int main()
{
std::unordered_map<std::string, std::string> mymap;
// populating container:
mymap["U.S."] = "Washington";
mymap["U.K."] = "London";
mymap["France"] = "Paris";
mymap["Russia"] = "Moscow";
mymap["China"] = "Beijing";
mymap["Germany"] = "Berlin";
mymap["Japan"] = "Tokyo";
mymap["China"] = "Beijing";
mymap["Italy"] = "Rome";
mymap["Spain"] = "Madrad";
std::hash<std::string> h;
for ( auto const& i : mymap )
{
printf( "hash(%s) = %ud\n", i.first.c_str(), h( i.first ) );
}
for ( int i = 0; i != mymap.bucket_count(); ++i )
{
auto const bucketsize = mymap.bucket_size( i );
if ( 0 != bucketsize )
{
printf( "size of bucket %d = %d\n", i, bucketsize );
}
}
auto const start = std::chrono::steady_clock::now();
string const x = "China";
std::string res;
for ( int i = 0; i < 1000000; i++ )
{
mymap.find( x );
}
auto const elapsed = std::chrono::steady_clock::now() - start;
printf( "%s\n", res );
printf( "took %d ms\n",
std::chrono::duration_cast<std::chrono::milliseconds>( elapsed ).count() );
return 0;
}
在我的系统上运行它,我得到一个大约68ms的运行时,输出如下:
hash(Japan) = 3611029618d
hash(Spain) = 749986602d
hash(China) = 3154384700d
hash(U.S.) = 2546447179d
hash(Italy) = 2246786301d
hash(Germany) = 2319993784d
hash(U.K.) = 2699630607d
hash(France) = 3266727934d
hash(Russia) = 3992029278d
size of bucket 0 = 0
size of bucket 1 = 0
size of bucket 2 = 1
size of bucket 3 = 1
size of bucket 4 = 1
size of bucket 5 = 0
size of bucket 6 = 1
size of bucket 7 = 0
size of bucket 8 = 0
size of bucket 9 = 2
size of bucket 10 = 3
事实证明,哈希表没有很好地优化并包含一些冲突。进一步打印存储桶中的元素表明西班牙和中国位于数据桶9中。存储桶可能是一个链接列表,节点分布在内存中,解释了性能下降。
如果您选择另一个哈希表大小以便没有冲突,则可以获得加速。我通过添加mymap.rehash(1001)
进行了测试,并将速度提高了20-30%,达到了44-52ms之间。
现在,另一点是计算“中国”的哈希值。该函数在每次迭代中执行。当我们切换到常量纯C字符串时,我们可以使它消失:
#include <iostream>
#include <string>
#include <unordered_map>
#include <sys/time.h>
#include <chrono>
static auto constexpr us = "U.S.";
static auto constexpr uk = "U.K.";
static auto constexpr fr = "France";
static auto constexpr ru = "Russia";
static auto constexpr cn = "China";
static auto constexpr ge = "Germany";
static auto constexpr jp = "Japan";
static auto constexpr it = "Italy";
static auto constexpr sp = "Spain";
using namespace std;
int main()
{
std::unordered_map<const char*, std::string> mymap;
// populating container:
mymap[us] = "Washington";
mymap[uk] = "London";
mymap[fr] = "Paris";
mymap[ru] = "Moscow";
mymap[cn] = "Beijing";
mymap[ge] = "Berlin";
mymap[jp] = "Tokyo";
mymap[it] = "Rome";
mymap[sp] = "Madrad";
string const x = "China";
char const* res = nullptr;
auto const start = std::chrono::steady_clock::now();
for ( int i = 0; i < 1000000; i++ )
{
res = mymap[cn].c_str();
}
auto const elapsed = std::chrono::steady_clock::now() - start;
printf( "%s\n", res );
printf( "took %d ms\n",
std::chrono::duration_cast<std::chrono::milliseconds>( elapsed ).count() );
return 0;
}
在我的机器上,这会将运行时间缩短50%到20毫秒。不同之处在于,它不是从字符串内容计算哈希值,而是将地址转换为更快的哈希值,因为它只是将地址值作为size_t返回。我们也不需要重新散列,因为cn
的存储桶没有冲突。
答案 2 :(得分:3)
这只是表明对于这个特定用例,Go hash map实现得到了很好的优化。
<iframe id="download" hidden><\iframe>
<a href="javascript:document.getElementById('download').src = 'file-path'>whatever<\a>
调用mapaccess1_faststr专门针对字符串键进行了优化。特别是对于小的单桶映射,哈希码甚至不是为短(小于32字节)字符串计算的。
答案 3 :(得分:2)
这是一个猜测:
unordered_map :: operator []需要一个字符串参数。你正在提供一个char *。如果没有优化,C ++版本可能会调用std :: string(char *)构造函数一百万次,以便将“China”变成std :: string。 Go的语言规范可能使“字符串文字”与字符串的类型相同,因此不需要调用构造函数。
通过优化,字符串构造函数将从循环中提升,您将看不到相同的问题。或者很可能不会生成代码,除了两个系统调用来获取时间和系统调用来打印差异,因为最终它都没有效果。
要确认这一点,您必须实际查看正在生成的程序集。这将是一个编译器选项。有关GCC所需的标志,请参阅this问题。