如何在循环中优化嵌套索引

时间:2015-11-18 03:27:03

标签: arrays nested

我在C中有一个非常简单的循环:

for (i=0; i < len; ++i) {
    beta[index[i]] += d * value[i];
}

在此循环中,beta和value是双数组,而index是整数数组。 beta本身可能是一个非常长的数组(可能是数百万个元素),但len通常要短得多,比如beta长度的5%。当然,所有阵列都是相互独立的。我们还可以假设索引中没有两个条目是相同的。让我烦恼的是,无论我做什么,似乎都没有任何帮助。到目前为止,我已经尝试使用restrict关键字,指定#pragma ivdep,手动展开,预取(虽然我可能已经应用了最后两个而没有正确的展开因子/预取前瞻),甚至尝试使用mkl来首先收集值更新,使用daxpy进行更新,然后分散结果。

有什么建议可以用这个来尽可能快地循环?我的平台是intel linux。

谢谢, --Laci

这是完整的代码:

#include <stdlib.h>
#include <math.h>

int main(int argc, char *argv[])
{
   int betasize = atoi(argv[1]);
   int len = atoi(argv[2]);
   double *beta = calloc(betasize, sizeof(double));
   double *value = malloc(len*sizeof(double));
   int *index = malloc(len*sizeof(int));
   int i;
   const double d = 2.5;

   /* randomly pick len entries */
   for (i=0; i < len; ++i) {
      while (1) {
         const int ind = floor(drand48()*betasize);
         if (beta[ind] == 0) {
            value[i] = drand48();
            index[i] = ind;
            beta[ind] = 1;
            break;
         }
      }
   }

   /* Now the loop to be optimized */
   for (i=0; i < len; ++i) {
      beta[index[i]] += d * value[i];
   }
}

将您喜爱的时间测量值放在循环周围进行优化,然后将其运行为&#39; a.out 100000000 1000000&#39;。您也可以使用较小的阵列,只是它更难以计时。此外,请注意随机索引生成将真正减慢,因为第二个数字接近第一个: - )...但在我的用例中,第二个数字通常不超过第一个数字的1%,并且很少超过5%。

1 个答案:

答案 0 :(得分:1)

好的,我在你的循环上运行了基准测试。我将它移动到一个单独的函数,并创建了10个类似的函数,每个函数都有轻微的变化。你的原文是alg_0a - 过去的参考资料。

其他是删除某些内容的各种组合(例如,使用固定数字而不是value[i])。这个概念是当你删除某些东西[仍然是真正的算法所必需]并且删除会显着改善性能时,删除的项目是性能瓶颈[“热点”]。要查看的算法是alg_0*。看一下每一个循环,看看会遗漏什么。

通过将alg_1*index数组合并到一个结构中,我试图在valueindex/value数组中获得更好的缓存性能/位置。这实际上几乎没有效果,因为顺序访问alg_1*,因此它们具有良好的缓存性能[即使是单独的]并且结构无法改进。因此,您可以跳过alg_0a数据。

效果最差[因为它做得最多]是原来的alg_0e。稍微好一点的是index,它来自valuebeta数组,但将顺序存储到alg_0d数组中。更好的是alg_0e执行beta所做的工作,除了它在固定索引中存储到beta

真正重要的唯一事情是写入index时的访问模式。由于beta数组中包含beta访问的随机索引,因此导致index数组的缓存性能较差。

index数组[随机]中的实际测试数据会使结果偏差[和/或无效]。在真实的程序中,如果数组真正具有半随机索引,那就是“热点”。

是否有更好的模型来生成更具代表性且可能更缓存友好的1,2,3,4 99,100,101,102 6000,6001,6002 ...数组?也就是说,真正的访问可能更像是value

在不知道如何创建indexvalue以及何时创建的情况下,很难推测。 indexindex/value是否始终同时创建?

从某种意义上说,循环遍历beta会形成更新的“时间表”,以应用于beta。如果此计划可以按beta索引进行排序,则可能会提供更多顺序beta访问权限。

index/value数组大于betalinked list的不同实施方式是否会更好(例如sparse matrixbeta)?也就是说,我们希望尽可能使openmp访问权限作为缓存友好。

通常,我建议使用memory/cache bound来并行化。但是,从我运行的测试中,由于随机访问模式以及测试似乎表明循环是CPU/computation bound不是 {{1 }}

现在我想到了,这看起来像是散点图 monte carlo 算法。

“反转”问题可能是有益的。这是一个片段:

for (betaidx = 0;  betaidx < betasize;  ++betaidx) {
    if (beta_needs_change(betaidx))
        beta[betaidx] = beta_new_value(betaidx);
}

请注意beta_needs_change需要 [可能inline]。大部分时间它会返回错误。这个[更大]的循环可能比现有的index/value循环更快,似乎违反直觉,但可能值得尝试看看。

我最好的猜测是“正确的”稀疏矩阵实现可能有所帮助。

或者,根据循环后beta 的使用方式,可以通过相同的索引所有内容索引变量:

for (idx = 0;  idx < len;  ++idx)
    beta[idx] += d * value[idx];

现在,beta必须以完全不同的方式阅读/使用。 value可能必须以不同方式构建。但是,如果您可以将程序CPU使用率的40%降低6倍,那么重新编写这样做可能是值得的。作为一个积极的附带好处,该计划的其余部分也可能从重组中受益。

这是[重]修改过的源代码。它不可构建,因为它依赖于我的特殊工具和库。忽略大部分内容并查看alg_0*函数内的循环。

#include <stdlib.h>
#include <math.h>

#define _BETALOOP_GLO_
#include <ovrlib/ovrlib.h>
#include <betaloop/bncdef.h>

#define FREEME(_ptr) \
    do { \
        if (_ptr != NULL) \
            free(_ptr); \
        _ptr = NULL; \
    } while (0)

typedef int betaidx_t;
typedef int validx_t;

typedef struct {
    double value;
    betaidx_t index;
    double *beta;
} pair_t;

typedef struct {
    betaidx_t betasize;                 // range value for beta
    validx_t vlen;                      // length of value array

    double *beta;
    double *value;
    betaidx_t *index;

    pair_t *pair;
} ctl_t;

// datagen_0a -- randomly pick len entries
void
datagen_0a(ctl_t *ctl,int betasize,int len)
{
    double *beta;
    double *value;
    betaidx_t *index;
    validx_t i;
    betaidx_t ind;

    memset(ctl,0,sizeof(ctl_t));

    ctl->betasize = betasize;
    ctl->vlen = len;

    beta = calloc(betasize,sizeof(double));
    ctl->beta = beta;

    value = malloc(len * sizeof(double));
    ctl->value = value;

    index = malloc(len * sizeof(int));
    ctl->index = index;

    BNCBEG(datagen_0a);

    for (i = 0; i < len; ++i) {
        while (1) {
            ind = floor(drand48() * betasize);
            ind %= betasize;
            if (beta[ind] == 0) {
                value[i] = drand48();
                index[i] = ind;
                beta[ind] = 1;
                break;
            }
        }
    }

    BNCEND(datagen_0a);
}

// datagen_0b -- randomly pick len entries
void
datagen_0b(ctl_t *ctl,betaidx_t betasize,int len)
{
    double *beta;
    double *value;
    double curval;
    betaidx_t *index;
    byte *btv;
    pair_t *pair;
    validx_t validx;
    betaidx_t betaidx;

    memset(ctl,0,sizeof(ctl_t));

    ctl->betasize = betasize;
    ctl->vlen = len;

    beta = calloc(betasize,sizeof(double));
    ctl->beta = beta;

    value = malloc(len * sizeof(double));
    ctl->value = value;

    index = malloc(len * sizeof(int));
    ctl->index = index;

    pair = malloc(len * sizeof(pair_t));
    ctl->pair = pair;

    btv = calloc(BTVSIZE(betasize),sizeof(byte));

    BNCBEG(datagen_0b);

    for (validx = 0;  validx < len;  ++validx) {
        while (1) {
            betaidx = floor(drand48() * betasize);
            betaidx %= betasize;
            if (! BTVTST(btv,betaidx)) {
                BTVSET(btv,betaidx);

                curval = drand48();
                value[validx] = drand48();
                index[validx] = betaidx;

                if (pair != NULL) {
                    pair[validx].value = curval;
                    pair[validx].index = betaidx;
                    pair[validx].beta = &beta[betaidx];
                }

                beta[betaidx] = 1;
                break;
            }
        }
    }

    BNCEND(datagen_0b);

    free(btv);
}

// datarls_0 -- release allocated memory
void
datarls_0(ctl_t *ctl)
{

    FREEME(ctl->beta);
    FREEME(ctl->value);
    FREEME(ctl->index);
    FREEME(ctl->pair);
}

// fixed_index -- get fixed beta index
betaidx_t
fixed_index(ctl_t *ctl)
{
    betaidx_t index;

    while (1) {
        index = floor(drand48() * ctl->betasize);
        index %= ctl->betasize;
        if ((index | 1) < ctl->betasize)
            break;
    }

    return index;
}

// alg_0a -- Now the loop to be optimized
void
alg_0a(ctl_t *ctl)
{
    double *beta;
    double *value;
    betaidx_t *index;
    validx_t validx;
    validx_t len;
    const double d = 2.5;

    BNCBEG(alg_0a);

    beta = ctl->beta;
    value = ctl->value;
    index = ctl->index;
    len = ctl->vlen;

    for (validx = 0;  validx < len;  ++validx)
        beta[index[validx]] += d * value[validx];

    BNCEND(alg_0a);
}

// alg_0b -- null destination
double
alg_0b(ctl_t *ctl)
{
    double beta;
    double *value;
    validx_t validx;
    validx_t len;
    const double d = 2.5;

    BNCBEG(alg_0b -- betanull);

    beta = 0.0;
    value = ctl->value;
    len = ctl->vlen;

    for (validx = 0;  validx < len;  ++validx)
        beta += d * value[validx];

    BNCEND(alg_0b);

    return beta;
}

// alg_0c -- fixed destination
void
alg_0c(ctl_t *ctl)
{
    double *beta;
    double *value;
    betaidx_t index;
    validx_t validx;
    validx_t len;
    const double d = 2.5;

    index = fixed_index(ctl);

    BNCBEG(alg_0c -- betafixed);

    beta = ctl->beta;
    value = ctl->value;
    len = ctl->vlen;

    for (validx = 0;  validx < len;  ++validx, index ^= 1)
        beta[index] += d * value[validx];

    BNCEND(alg_0c);
}

// alg_0d -- fixed destination with index array fetch
betaidx_t
alg_0d(ctl_t *ctl)
{
    double *beta;
    double *value;
    betaidx_t *idxptr;
    betaidx_t index;
    validx_t validx;
    validx_t len;
    const double d = 2.5;
    betaidx_t totidx;

    index = fixed_index(ctl);

    BNCBEG(alg_0d -- beta_fixed_index);

    beta = ctl->beta;
    value = ctl->value;
    idxptr = ctl->index;
    len = ctl->vlen;
    totidx = 0;

    for (validx = 0;  validx < len;  ++validx, index ^= 1) {
        totidx += idxptr[validx];
        beta[index] += d * value[validx];
    }

    BNCEND(alg_0d);

    return totidx;
}

// alg_0e -- sequential destination with index array fetch
betaidx_t
alg_0e(ctl_t *ctl)
{
    double *beta;
    double *value;
    betaidx_t *idxptr;
    betaidx_t index;
    validx_t validx;
    validx_t len;
    const double d = 2.5;
    betaidx_t totidx;

    BNCBEG(alg_0e -- beta_seq_index);

    index = 0;
    beta = ctl->beta;
    value = ctl->value;
    idxptr = ctl->index;
    len = ctl->vlen;
    totidx = 0;

    for (validx = 0;  validx < len;  ++validx) {
        totidx += idxptr[validx];
        beta[index] += d * value[validx];
        index = (index + 1) % ctl->betasize;
    }

    BNCEND(alg_0e);

    return totidx;
}

// alg_0f -- null source
void
alg_0f(ctl_t *ctl)
{
    double *beta;
    double value;
    betaidx_t *index;
    validx_t validx;
    validx_t len;
    const double d = 2.5;

    value = drand48();

    BNCBEG(alg_0f -- nullsrc);

    beta = ctl->beta;
    index = ctl->index;
    len = ctl->vlen;

    for (validx = 0;  validx < len;  ++validx)
        beta[index[validx]] += d * value;

    BNCEND(alg_0f);
}

// alg_1a -- use pair struct with index
void
alg_1a(ctl_t *ctl)
{
    double *beta;
    validx_t validx;
    validx_t len;
    const pair_t *pair;
    const double d = 2.5;

    BNCBEG(alg_1a -- pair);

    beta = ctl->beta;
    len = ctl->vlen;
    pair = ctl->pair;

    for (validx = 0;  validx < len;  ++validx, ++pair)
        beta[pair->index] += d * pair->value;

    BNCEND(alg_1a);
}

// alg_1b -- use pair struct with epair
void
alg_1b(ctl_t *ctl)
{
    double *beta;
    const pair_t *pair;
    const pair_t *epair;
    const double d = 2.5;

    BNCBEG(alg_1b -- epair);

    beta = ctl->beta;
    pair = ctl->pair;
    epair = pair + ctl->vlen;

    for (;  pair < epair;  ++pair)
        beta[pair->index] += d * pair->value;

    BNCEND(alg_1b);
}

// alg_1c -- use pair struct, epair, and beta pointer
void
alg_1c(ctl_t *ctl)
{
    const pair_t *pair;
    const pair_t *epair;
    const double d = 2.5;

    BNCBEG(alg_1c -- betap);

    pair = ctl->pair;
    epair = pair + ctl->vlen;

    for (;  pair < epair;  ++pair)
        *pair->beta += d * pair->value;

    BNCEND(alg_1c);
}

// alg_1d -- fixed destination with index array fetch
betaidx_t
alg_1d(ctl_t *ctl)
{
    double *beta;
    const pair_t *pair;
    const pair_t *epair;
    const double d = 2.5;
    betaidx_t index;
    betaidx_t totidx;

    index = fixed_index(ctl);

    BNCBEG(alg_1d -- beta_fixed_index);

    beta = ctl->beta;
    pair = ctl->pair;
    epair = pair + ctl->vlen;
    totidx = 0;

    for (;  pair < epair;  ++pair, index ^= 1) {
        totidx += pair->index;
        beta[index] += d * pair->value;
    }

    BNCEND(alg_1d);

    return totidx;
}

// alg_1e -- sequential destination with index array fetch
betaidx_t
alg_1e(ctl_t *ctl)
{
    double *beta;
    const pair_t *pair;
    const pair_t *epair;
    const double d = 2.5;
    betaidx_t index;
    betaidx_t totidx;

    BNCBEG(alg_1e -- beta_seq_index);

    beta = ctl->beta;
    pair = ctl->pair;
    epair = pair + ctl->vlen;
    totidx = 0;
    index = 0;

    for (;  pair < epair;  ++pair) {
        totidx += pair->index;
        beta[index] += d * pair->value;
        index = (index + 1) % ctl->betasize;
    }

    BNCEND(alg_1e);

    return totidx;
}

// dotest -- do test
void
dotest(int betasize,int len)
{
    ctl_t ctl;
    int tryidx;

    printf("\n");
    printf("dotest: %d %d\n",betasize,len);

    BNCBEG(dotest);

#if 0
    datagen_0a(&ctl,betasize,len);
#endif
#if 1
    datagen_0b(&ctl,betasize,len);
#endif
    for (tryidx = 1;  tryidx <= 3;  ++tryidx) {
        alg_0a(&ctl);
        alg_0b(&ctl);
        alg_0c(&ctl);
        alg_0d(&ctl);
        alg_0e(&ctl);
        alg_0f(&ctl);

        alg_1a(&ctl);
        alg_1b(&ctl);
        alg_1c(&ctl);
        alg_1d(&ctl);
        alg_1e(&ctl);
    }
    datarls_0(&ctl);

    BNCEND(dotest);

    bncdmpa("dotest",1);
}

// main -- main program
int
main(int argc,char **argv)
{

    --argc;
    ++argv;

    bncatt(betaloop_bnc);

    dotest(100000000,1000000);
    dotest(500000000,5000000);
    dotest(1000000000,10000000);

    return 0;
}

这是基准输出。 min时间(小数秒)是要比较的数字。

17:39:35.550606012 NEWDAY 11/18/15
17:39:35.550606012 ph: starting 13162 ...
17:39:35.551221132 ph: ARGV ovrgo ...

17:39:36 ovrgo: SDIR /home/cae/preserve/ovrbnc/betaloop
17:39:37 ovrgo: /home/cae/preserve/ovrstk/gen/betaloop/betaloop
bnctst: BEST bncmin=21 bncmax=-1 skipcnt=1000
bnctst: AVG tot=0.000000000

bnctst: SKP tot=0.000023607 avg=0.000000023 cnt=1,000
bnctst: BNCDMP   min=0.000000000 max=0.000000000

dotest: 100000000 1000000
dotest: BNCDMP alg_0a tot=0.087135098 avg=0.029045032 cnt=3
dotest: BNCDMP   min=0.028753797 max=0.029378731
dotest: BNCDMP alg_0b -- betanull tot=0.003669541 avg=0.001223180 cnt=3
dotest: BNCDMP   min=0.001210105 max=0.001242112
dotest: BNCDMP alg_0c -- betafixed tot=0.005472318 avg=0.001824106 cnt=3
dotest: BNCDMP   min=0.001815115 max=0.001830939
dotest: BNCDMP alg_0d -- beta_fixed_index tot=0.005654055 avg=0.001884685 cnt=3
dotest: BNCDMP   min=0.001883760 max=0.001885919
dotest: BNCDMP alg_0e -- beta_seq_index tot=0.025247095 avg=0.008415698 cnt=3
dotest: BNCDMP   min=0.008410631 max=0.008423921
dotest: BNCDMP alg_0f -- nullsrc tot=0.085769224 avg=0.028589741 cnt=3
dotest: BNCDMP   min=0.028477846 max=0.028683057
dotest: BNCDMP alg_1a -- pair tot=0.090740003 avg=0.030246667 cnt=3
dotest: BNCDMP   min=0.030003776 max=0.030385588
dotest: BNCDMP alg_1b -- epair tot=0.093591309 avg=0.031197103 cnt=3
dotest: BNCDMP   min=0.030324733 max=0.032524565
dotest: BNCDMP alg_1c -- betap tot=0.091931228 avg=0.030643742 cnt=3
dotest: BNCDMP   min=0.030357306 max=0.031191412
dotest: BNCDMP alg_1d -- beta_fixed_index tot=0.007939126 avg=0.002646375 cnt=3
dotest: BNCDMP   min=0.002508210 max=0.002853244
dotest: BNCDMP alg_1e -- beta_seq_index tot=0.025939159 avg=0.008646386 cnt=3
dotest: BNCDMP   min=0.008606238 max=0.008683529
dotest: BNCDMP datagen_0b tot=0.136931619
dotest: BNCDMP dotest tot=0.956365745

dotest: 500000000 5000000
dotest: BNCDMP alg_0a tot=0.737332506 avg=0.245777502 cnt=3
dotest: BNCDMP   min=0.244778177 max=0.247548555
dotest: BNCDMP alg_0b -- betanull tot=0.018095312 avg=0.006031770 cnt=3
dotest: BNCDMP   min=0.005912708 max=0.006225743
dotest: BNCDMP alg_0c -- betafixed tot=0.028059365 avg=0.009353121 cnt=3
dotest: BNCDMP   min=0.009303443 max=0.009407530
dotest: BNCDMP alg_0d -- beta_fixed_index tot=0.029024875 avg=0.009674958 cnt=3
dotest: BNCDMP   min=0.009550901 max=0.009752188
dotest: BNCDMP alg_0e -- beta_seq_index tot=0.127149609 avg=0.042383203 cnt=3
dotest: BNCDMP   min=0.042218860 max=0.042529218
dotest: BNCDMP alg_0f -- nullsrc tot=0.724878907 avg=0.241626302 cnt=3
dotest: BNCDMP   min=0.240794352 max=0.242174302
dotest: BNCDMP alg_1a -- pair tot=0.764044535 avg=0.254681511 cnt=3
dotest: BNCDMP   min=0.253329522 max=0.256864373
dotest: BNCDMP alg_1b -- epair tot=0.769463084 avg=0.256487694 cnt=3
dotest: BNCDMP   min=0.254830714 max=0.258763409
dotest: BNCDMP alg_1c -- betap tot=0.765345462 avg=0.255115154 cnt=3
dotest: BNCDMP   min=0.254364352 max=0.256134647
dotest: BNCDMP alg_1d -- beta_fixed_index tot=0.039104441 avg=0.013034813 cnt=3
dotest: BNCDMP   min=0.012103513 max=0.014354033
dotest: BNCDMP alg_1e -- beta_seq_index tot=0.130221038 avg=0.043407012 cnt=3
dotest: BNCDMP   min=0.043143231 max=0.043752516
dotest: BNCDMP datagen_0b tot=2.060880641
dotest: BNCDMP dotest tot=6.611719277

dotest: 1000000000 10000000
dotest: BNCDMP alg_0a tot=1.726930574 avg=0.575643524 cnt=3
dotest: BNCDMP   min=0.575218786 max=0.576291884
dotest: BNCDMP alg_0b -- betanull tot=0.035615393 avg=0.011871797 cnt=3
dotest: BNCDMP   min=0.011820026 max=0.011948646
dotest: BNCDMP alg_0c -- betafixed tot=0.056452922 avg=0.018817640 cnt=3
dotest: BNCDMP   min=0.018590739 max=0.019195537
dotest: BNCDMP alg_0d -- beta_fixed_index tot=0.057788343 avg=0.019262781 cnt=3
dotest: BNCDMP   min=0.019061426 max=0.019560949
dotest: BNCDMP alg_0e -- beta_seq_index tot=0.253575597 avg=0.084525199 cnt=3
dotest: BNCDMP   min=0.084169403 max=0.084902168
dotest: BNCDMP alg_0f -- nullsrc tot=1.718326633 avg=0.572775544 cnt=3
dotest: BNCDMP   min=0.571082648 max=0.575134694
dotest: BNCDMP alg_1a -- pair tot=1.792905583 avg=0.597635194 cnt=3
dotest: BNCDMP   min=0.590378177 max=0.603253947
dotest: BNCDMP alg_1b -- epair tot=1.797667694 avg=0.599222564 cnt=3
dotest: BNCDMP   min=0.589916620 max=0.609757778
dotest: BNCDMP alg_1c -- betap tot=1.794606586 avg=0.598202195 cnt=3
dotest: BNCDMP   min=0.593164605 max=0.604739478
dotest: BNCDMP alg_1d -- beta_fixed_index tot=0.073755595 avg=0.024585198 cnt=3
dotest: BNCDMP   min=0.024126694 max=0.025124542
dotest: BNCDMP alg_1e -- beta_seq_index tot=0.261664945 avg=0.087221648 cnt=3
dotest: BNCDMP   min=0.086277263 max=0.087966703
dotest: BNCDMP datagen_0b tot=4.160519571
dotest: BNCDMP dotest tot=14.607990774

17:39:59.970197677 ph: complete (ELAPSED: 00:00:24.418215274)