C优化 - 低级代码

时间:2013-09-05 15:04:53

标签: c++ c optimization memory-management

我正在尝试编写一个与dlmalloc相当的内存分配器,这是glibc中使用的malloc。 dlmalloc是一个具有块拆分的最佳匹配器,它在将块再次合并为大块之前保留了最近使用的块池。我正在写的分配器首先适合它。

我的问题有两个:(1)我的代码的测试时间与glibc malloc的测试时间相比非常不规则;(2)有些日子我的代码的平均运行时间将是3到4倍; (2)并不是什么大问题,但我想理解为什么glibc malloc不会以同样的方式受到影响。此帖还显示了malloc和我的代码之间(1)中描述的行为示例。有时,一批1000次测试的平均时间远远高于malloc的时间(上面的问题(2)),有时平均值是相同的。但是,对我的代码进行一批测试的测试时间总是非常不规则(上面的问题(1));意味着在一批测试中有时间跳跃到平均值的20倍,并且这些跳跃散布在其他常规(接近平均)时间内。 glibc malloc不会这样做。

我正在研究的代码如下。

===================================

/* represent an allocated/unallocated  block of memory */
struct Block {

    /* previous allocated or unallocated block needed for consolidation but not used in allocation */
    Block* prev;
    /* 1 if allocated and 0 if not */
    unsigned int tagh;
   /* previous unallocated block */
   Block* prev_free;
   /* next unallocated block  */
   Block* next_free;
   /* size of current block */
   unsigned int size;
};

#define CACHE_SZ 120000000

/* array to be managed by allocator */
char arr[CACHE_SZ] __attribute__((aligned(4)));

/* initialize the contiguous memory located at arr for allocator */
void init_cache(){
/* setup list head node that does not change */
   Block* a = (Block*)  arr;
  a->prev = 0; 
  a->tagh = 1;
  a->prev_free = 0;
  a->size = 0;

/* setup the usable data block */
  Block* b = (Block*) (arr + sizeof(Block));
  b->prev = a; 
  b->tagh = 0;
  b->prev_free = a;
  b->size = CACHE_SZ - 3*sizeof(Block);
  a->next_free = b;

/* setup list tail node that does not change */
  Block* e = (Block*)((char*)arr + CACHE_SZ - sizeof(Block)); 
  e->prev = b;
  e->tagh = 1;
  e->prev_free = b;
  e->next_free = 0;
  e->size = 0;
  b->next_free = e;
}

char* alloc(unsigned int size){
  register Block* current = ((Block*) arr)->next_free; 
  register Block* new_block;

/* search for a first-fit block */

   while(current != 0){
       if( current->size >= size + sizeof(Block)) goto good;
       current = current->next_free;
   }

/* what to do if no decent size block found */
   if( current == 0) {
       return 0;
   }

/* good block found */
good:
/* if block size is exact return it */
   if( current->size == size){
       if(current->next_free != 0) current->next_free->prev_free = current->prev_free;
       if(current->prev_free != 0) current->prev_free->next_free = current->next_free;
       return (char* ) current + sizeof(Block);
   }

/* otherwise split the block */

   current->size -= size + sizeof(Block); 

    new_block = (Block*)( (char*)current + sizeof(Block) + current->size);
    new_block->size = size;
    new_block->prev = current;
    new_block->tagh = 1;
   ((Block*)((char*) new_block + sizeof(Block) + new_block->size ))->prev = new_block;

   return (char* ) new_block + sizeof(Block);
}

main(int argc, char** argv){
    init_cache();
    int count = 0;

/* the count considers the size of the cache arr */
    while(count < 4883){

/* the following line tests malloc; the quantity(1024*24) ensures word alignment */
   //char * volatile p = (char *) malloc(1024*24);
/* the following line tests above code in exactly the same way */
    char * volatile p = alloc(1024*24);
        count++;

    }
}

=====================================

我只用以下代码编译上面的代码:

g ++ -O9 alloc.c

并运行一个简单的测试,总是分割块,永远不会返回一个确切的大小块:

bash $ for((i = 0; i&lt; 1000; i ++)); do(time ./a.out)2&gt;&amp; 1 | grep real;完成

我的代码和glibc malloc测试的示例输出如下:

我的代码:

real    0m0.023s
real    0m0.109s    <----- irregular jump >
real    0m0.024s
real    0m0.086s
real    0m0.022s
real    0m0.104s    <----- again irregular jump >
real    0m0.023s
real    0m0.023s
real    0m0.098s
real    0m0.023s
real    0m0.097s
real    0m0.024s
real    0m0.091s
real    0m0.023s
real    0m0.025s
real    0m0.088s
real    0m0.023s
real    0m0.086s
real    0m0.024s
real    0m0.024s

malloc代码(漂亮且经常保持接近20毫秒):

real    0m0.025s
real    0m0.024s
real    0m0.024s
real    0m0.026s
real    0m0.024s
real    0m0.026s
real    0m0.025s
real    0m0.026s
real    0m0.026s
real    0m0.025s
real    0m0.025s
real    0m0.024s
real    0m0.024s
real    0m0.024s
real    0m0.025s
real    0m0.026s
real    0m0.025s

请注意,malloc代码时间更加规则。在其他不可预测的时间,我的代码有0m0.070s而不是0m0.020s,因此平均运行时间接近70ms而不是25ms(上面的问题(2)),但这里没有显示。在这种情况下,我很幸运,它的运行接近malloc(25ms)的平均值

问题是,(1)我如何修改我的代码以获得更多常规时间,例如glibc malloc? (2)如果可能的话,我怎么能比glibc malloc更快,因为我已经读过dlmalloc是一个特征平衡的分配器并且不是最快的(只考虑分裂/最佳拟合/首先适合的分配器而不是其他分配器) ?

2 个答案:

答案 0 :(得分:5)

不要使用'真实'时间:尝试'用户'+'sys'。大量迭代的平均值。问题有两个:(a)您的过程并不是处理器上的唯一过程,而是根据其他过程的作用而中断,(b)时间测量具有粒度。我不确定它今天是什么,但在过去,它只是时间片的大小=&gt; 1/100秒。

答案 1 :(得分:5)

是的,我比较了两种解决方案,并以几种不同的方式运行它们。我不知道问题是什么,但我的猜测是,大部分时间花在“创建一个1200000000字节的大型连续板”上。如果我减小了大小,并且仍然执行相同数量的分配,则时间会减少。

指出这一点的另一个证据是system时间是real时间的很大一部分,其中user时间几乎为零。

现在,在我的系统上,一旦我在高内存负载下运行这些东西几次,它就不会真正摆动那么多。这很可能是因为一旦我换掉了一堆积累在内存中的旧垃圾,系统就会有足够的“备用”页面用于我的进程。当内存受到更多限制时(因为我让系统去做其他一些事情,比如在我试验的“网站”上做一些数据库工作[它是真实网站的“沙盒”版本,所以它有数据库中的真实数据,并且可以快速填充内存等等,我得到更多的变化,直到我再次清理内存。

但我认为“神秘”的关键在于系统时间是所用时间的绝大部分。值得注意的是,当使用带有大块的malloc时,内存实际上并没有“真正分配”。当分配较小的块时,似乎malloc实际上在某种程度上更聪明,并且比“优化”的分配更快 - 至少对于更大的内存量。不要问我到底是怎么回事。

以下是一些证据:

我更改了代码中的main

#define BLOCK_SIZE (CACHE_SZ / 5000)

int main(int argc, char** argv){
    init_cache();
    int count = 0;
    int failed = 0;
    size_t size = 0;

/* the count considers the size of the cache arr */
    while(count < int((CACHE_SZ / BLOCK_SIZE) * 0.96) ){

/* the following line tests malloc; the quantity(1024*24) ensures word alignment */
   //char * volatile p = (char *) malloc(1024*24);
/* the following line tests above code in exactly the same way */
    char * volatile p;
    if (argc > 1) 
        p = (char *)malloc(BLOCK_SIZE);
    else
        p = alloc(BLOCK_SIZE);
    if (p == 0)
    {
        failed++;
        puts("p = NULL\n");
    }
    count++;
    size += BLOCK_SIZE;
    }
    printf("Count = %d, total=%zd, failed=%d\n", count, size, failed);
}

然后改变CACHE_SZ并使用或不使用参数运行以使用allocmalloc选项:

因此,缓存大小为12000000(12MB):

数字是:

real    0m0.008s
user    0m0.001s
sys 0m0.007s
Count = 4800, total=11520000, failed=0

real    0m0.007s
user    0m0.000s
sys 0m0.006s
Count = 4800, total=11520000, failed=0

real    0m0.008s
user    0m0.001s
sys 0m0.006s
Count = 4800, total=11520000, failed=0

real    0m0.014s
user    0m0.003s
sys 0m0.010s

使用malloc进行一些运行:

real    0m0.010s
user    0m0.000s
sys 0m0.009s
Count = 4800, total=11520000, failed=0

real    0m0.017s
user    0m0.001s
sys 0m0.015s
Count = 4800, total=11520000, failed=0

real    0m0.012s
user    0m0.001s
sys 0m0.010s
Count = 4800, total=11520000, failed=0

real    0m0.021s
user    0m0.007s
sys 0m0.013s
Count = 4800, total=11520000, failed=0

real    0m0.010s
user    0m0.001s
sys 0m0.008s
Count = 4800, total=11520000, failed=0

real    0m0.009s
user    0m0.001s
sys 0m0.007s

使缓存大小增加10倍会为alloc提供以下结果:

real    0m0.038s
user    0m0.001s
sys 0m0.036s
Count = 4800, total=115200000, failed=0

real    0m0.040s
user    0m0.001s
sys 0m0.037s
Count = 4800, total=115200000, failed=0

real    0m0.045s
user    0m0.001s
sys 0m0.043s
Count = 4800, total=115200000, failed=0

real    0m0.044s
user    0m0.001s
sys 0m0.043s
Count = 4800, total=115200000, failed=0

real    0m0.046s
user    0m0.001s
sys 0m0.043s
Count = 4800, total=115200000, failed=0

real    0m0.042s
user    0m0.000s
sys 0m0.042s

使用malloc

real    0m0.026s
user    0m0.004s
sys 0m0.021s
Count = 4800, total=115200000, failed=0

real    0m0.027s
user    0m0.002s
sys 0m0.023s
Count = 4800, total=115200000, failed=0

real    0m0.022s
user    0m0.002s
sys 0m0.018s
Count = 4800, total=115200000, failed=0

real    0m0.016s
user    0m0.001s
sys 0m0.015s
Count = 4800, total=115200000, failed=0

real    0m0.027s
user    0m0.002s
sys 0m0.024s
Count = 4800, total=115200000, failed=0

另外10x alloc

real    0m1.408s
user    0m0.002s
sys 0m1.395s
Count = 4800, total=1152000000, failed=0

real    0m1.517s
user    0m0.001s
sys 0m1.505s
Count = 4800, total=1152000000, failed=0

real    0m1.478s
user    0m0.000s
sys 0m1.466s
Count = 4800, total=1152000000, failed=0

real    0m1.401s
user    0m0.001s
sys 0m1.389s
Count = 4800, total=1152000000, failed=0

real    0m1.445s
user    0m0.002s
sys 0m1.433s
Count = 4800, total=1152000000, failed=0

real    0m1.468s
user    0m0.000s
sys 0m1.458s
Count = 4800, total=1152000000, failed=0

使用malloc

real    0m0.020s
user    0m0.002s
sys 0m0.017s
Count = 4800, total=1152000000, failed=0

real    0m0.022s
user    0m0.001s
sys 0m0.020s
Count = 4800, total=1152000000, failed=0

real    0m0.027s
user    0m0.005s
sys 0m0.021s
Count = 4800, total=1152000000, failed=0

real    0m0.029s
user    0m0.002s
sys 0m0.026s
Count = 4800, total=1152000000, failed=0

real    0m0.020s
user    0m0.001s
sys 0m0.019s
Count = 4800, total=1152000000, failed=0

如果我们更改代码以使BLOCK_SIZE常量为1000,则allocmalloc之间的差异会小得多。这是alloc结果:

 Count = 1080000, total=1080000000, failed=0

real    0m1.183s
user    0m0.028s
sys 0m1.137s
Count = 1080000, total=1080000000, failed=0

real    0m1.179s
user    0m0.017s
sys 0m1.143s
Count = 1080000, total=1080000000, failed=0

real    0m1.196s
user    0m0.026s
sys 0m1.152s
Count = 1080000, total=1080000000, failed=0

real    0m1.197s
user    0m0.023s
sys 0m1.157s
Count = 1080000, total=1080000000, failed=0

real    0m1.188s
user    0m0.021s
sys 0m1.147s

现在malloc

Count = 1080000, total=1080000000, failed=0

real    0m0.582s
user    0m0.063s
sys 0m0.482s
Count = 1080000, total=1080000000, failed=0

real    0m0.586s
user    0m0.062s
sys 0m0.489s
Count = 1080000, total=1080000000, failed=0

real    0m0.582s
user    0m0.059s
sys 0m0.483s
Count = 1080000, total=1080000000, failed=0

real    0m0.590s
user    0m0.064s
sys 0m0.477s
Count = 1080000, total=1080000000, failed=0

real    0m0.586s
user    0m0.075s
sys 0m0.473s