1：没有优化

Question

我试图理解为什么在没有使用它们的情况下使用-O2 -march = native和GCC会给出更慢的代码。请注意，我在Windows 7下使用MinGW（GCC 4.7.1）。

这是我的代码：

struct.hpp：

#ifndef STRUCT_HPP
#define STRUCT_HPP

#include <iostream>

class Figure
{
public:
    Figure(char *pName);
    virtual ~Figure();

    char *GetName();
    double GetArea_mm2(int factor);

private:
    char name[64];
    virtual double GetAreaEx_mm2() = 0;
};

class Disk : public Figure
{
public:
    Disk(char *pName, double radius_mm);
    ~Disk();

private:
    double radius_mm;
    virtual double GetAreaEx_mm2();
};

class Square : public Figure
{
public:
    Square(char *pName, double side_mm);
    ~Square();  

private:
    double side_mm;
    virtual double GetAreaEx_mm2();
};

#endif

struct.cpp：

#include <cstdio>
#include "struct.hpp"

Figure::Figure(char *pName)
{
    sprintf(name, pName);
}

Figure::~Figure()
{
}

char *Figure::GetName()
{
    return name;
}

double Figure::GetArea_mm2(int factor)
{
    return (double)factor*GetAreaEx_mm2();
}

Disk::Disk(char *pName, double radius_mm_) :
Figure(pName), radius_mm(radius_mm_)
{
}

Disk::~Disk()
{
}

double Disk::GetAreaEx_mm2()
{
    return 3.1415926*radius_mm*radius_mm;
}

Square::Square(char *pName, double side_mm_) :
Figure(pName), side_mm(side_mm_)
{
}

Square::~Square()
{
}

double Square::GetAreaEx_mm2()
{
    return side_mm*side_mm;
}

的main.cpp

#include <iostream>
#include <cstdio>
#include "struct.hpp"

double Do(int n)
{
    double sum_mm2 = 0.0;
    const int figuresCount = 10000;
    Figure **pFigures = new Figure*[figuresCount];

    for (int i = 0; i < figuresCount; ++i)
    {
        if (i % 2)
            pFigures[i] = new Disk((char *)"-Disque", i);
        else
            pFigures[i] = new Square((char *)"-Carré", i);
    }

    for (int a = 0; a < n; ++a)
    {
        for (int i = 0; i < figuresCount; ++i)
        {
            sum_mm2 += pFigures[i]->GetArea_mm2(i);
            sum_mm2 += (double)(pFigures[i]->GetName()[0] - '-');
        }
    }

    for (int i = 0; i < figuresCount; ++i)
        delete pFigures[i];

    delete[] pFigures;

    return sum_mm2;
}

int main()
{
    double a = 0;

    StartChrono();      // home made lib, working fine
    a = Do(10000);
    double elapsedTime_ms = StopChrono();

    std::cout << "Elapsed time : " << elapsedTime_ms << " ms" << std::endl;

    return (int)a % 2;  // To force the optimizer to keep the Do() call
}

我编译了两次代码：

1：没有优化

mingw32-g ++。exe -Wall -fexceptions -std = c ++ 11 -c main.cpp -o main.o

mingw32-g ++。exe -Wall -fexceptions -std = c ++ 11 -c struct.cpp -o struct.o

mingw32-g ++。exe -o program.exe main.o struct.o -s

2：使用-O2优化

mingw32-g ++。exe -Wall -fexceptions -O2 -march = native -std = c ++ 11 -c main.cpp -o main.o

mingw32-g ++。exe -Wall -fexceptions -O2 -march = native -std = c ++ 11 -c struct.cpp -o struct.o

mingw32-g ++。exe -o program.exe main.o struct.o -s

1：执行时间：

1196 ms（使用Visual Studio 2013时为1269 ms）

2：执行时间：

1569 ms（使用Visual Studio 2013时为403 ms）!!!!!!!!!!!!!

使用-O3而不是-O2不会改善结果。我当时，我仍然相信GCC和Visual Studio是等价的，所以我不明白这个巨大的差异。另外，我不明白为什么优化版本比GCC的非优化版本慢。

我在这里想念一下吗？ （注意我在Ubuntu上使用正版GCC 4.8.2时遇到了同样的问题）

感谢您的帮助

Answer 1

考虑到我没有看到汇编代码，我将推测以下内容：

可以通过删除if子句并导致以下内容来优化（通过编译器）分配循环：

 for (int i=0;i <10000 ; i+=2)
 {
       pFigures[i] = new Square(...);
 }
 for (int i=1;i <10000 ; i +=2)
 {
       pFigures[i] = new Disk(...);
 }

考虑到结束条件是4的倍数，它可以更加“有效”

 for (int i=0;i < 10000 ;i+=2*4)
 {
     pFigures[i] = ...
     pFigures[i+2] = ...
     pFigures[i+4] = ...
     pFigures[i+6] = ...
 }

记忆方面，这将使磁盘分配4乘4平方4乘4。

现在，这意味着他们将在彼此相邻的记忆中找到。

接下来，您将以正常顺序迭代向量10000次（通过索引后的正常索引）。

考虑这些形状在内存中分配的位置。你最终会有4倍的缓存未命中（想想边框示例，当在不同页面中找到4个磁盘和4个方块时，您将在页面之间切换8次......在正常情况下，您只需在页面之间切换一次。）

这种优化（如果由编译器和您的特定代码完成）优化了分配的时间，但不是访问时间（在您的示例中是最大负载）。

通过删除i％2进行测试，看看你得到了什么结果。

这又是纯粹的推测，它假设性能较低的原因是循环优化。

Answer 2

我怀疑你在Windows上遇到了mingw / gcc / glibc组合的一个独特问题，因为你的代码在Linux上进行优化时表现得更快，而gcc在家里更加“在家里”。

在使用gcc 4.8.2的相当行人的Linux VM上：

$ g++ main.cpp struct.cpp
$ time a.out

real    0m2.981s
user    0m2.876s
sys     0m0.079s

$ g++ -O2 main.cpp struct.cpp
$ time a.out

real    0m1.629s
user    0m1.523s
sys     0m0.041s

...如果你真的通过删除struct.cpp并将实现全部内联移动来从优化器中删除闪烁：

$ time a.out

real    0m0.550s
user    0m0.543s
sys     0m0.000s

“坏”GCC优化性能

1：没有优化

2：使用-O2优化

1：执行时间：

2：执行时间：

2 个答案: