矩阵乘法性能numpy和eigen c ++

时间:2017-12-19 01:53:00

标签: c++ eigen

我试图比较使用C ++和numpy的特征的矩阵乘法性能。

这是矩阵乘法的c ++代码

#include<iostream>
#include <Eigen/Dense>
#include <ctime>
#include <iomanip> 

using namespace Eigen;
using namespace std;


    int main()
{
    time_t begin,end;
    double difference=0;
    time (&begin);
    for(int i=0;i<500;++i)
    {
    MatrixXd m1 = MatrixXd::Random(500,500);
    MatrixXd m2 = MatrixXd::Random(500,500);
    MatrixXd m3 = MatrixXd::Zero(500,500);
    m3=m1*m2;
    }
    time (&end); 
    difference = difftime (end,begin);
    std::cout<<"time = "<<std::setprecision(10)<<(difference/500.)<<" seconds"<<std::endl;

    return 0;
}

使用g++ -Wall -Wextra -I "path-to-eigen-directory" prog5.cpp -o prog5 -O3 -std=gnu++0x

进行编译

输出:

time = 0.116 seconds

这是python代码。

import timeit
import numpy as np

start_time = timeit.default_timer()
for i in range(500):

    m1=np.random.rand(500,500)
    m2=np.random.rand(500,500)
    m3=np.zeros((500,500))
    m3=np.dot(m1,m2)

stop_time = timeit.default_timer()
print('Time = {} seconds'.format((stop_time-start_time)/500))

输出:

Time = 0.01877937281645333 seconds

与python相比,C ++代码看起来慢了6倍。有人可以提供一些见解我是否在这里遗失任何东西?

我使用的是Eigen 3.3.4,g ++编译器(MinGW.org GCC-6.3.0-1)6.3.0,python 3.6.1,numpy 1.11.3。 Python与spyder ide一起运行。使用Windows。

更新

根据答案和评论,我更新了代码。

使用g++ -Wall -Wextra -I "path-to-eigen-directory" prog5.cpp -o prog5 -O3 -std=gnu++0x -march=native编译的C ++代码。我无法让-fopenmp工作 - 如果我使用这个标志,似乎没有输出。

#include<iostream>
#include <Eigen/Dense>
#include <ctime>
#include <iomanip> 

using namespace Eigen;
using namespace std;

int main()
{
    time_t begin,end;
    double difference=0;
    time (&begin);
    for(int i=0;i<10000;++i)
    {
    MatrixXd m1 = MatrixXd::Random(500,500);
    MatrixXd m2 = MatrixXd::Random(500,500);
    MatrixXd m3 = MatrixXd::Zero(500,500);
    m3=m1*m2;
    }
    time (&end); // note time after execution
    difference = difftime (end,begin);
    std::cout<<"Total time = "<<difference<<" seconds"<<std::endl;
    std::cout<<"Average time = "<<std::setprecision(10)<<(difference/10000.)<<" seconds"<<std::endl;

    return 0;
}

输出:

Total time = 328 seconds
Average time = 0.0328 seconds

Python代码:

import timeit
import numpy as np

start_time = timeit.default_timer()
for i in range(10000):

    m1=np.random.rand(500,500)
    m2=np.random.rand(500,500)
    m3=np.zeros((500,500))
    m3=np.dot(m1,m2)

stop_time = timeit.default_timer()
print('Total time = {} seconds'.format(stop_time-start_time))
print('Average time = {} seconds'.format((stop_time-start_time)/10000))

使用spyder IDE运行runfile('filename.py')命令。

输出:

Total time = 169.35587796526667 seconds
Average time = 0.016935587796526666 seconds

现在使用本征的性能更好,但不等于或快于numpy。可能-fopenmp可以做到这一点,但不确定。但是,我没有在numpy中使用任何并行化,除非它隐含地这样做。

1 个答案:

答案 0 :(得分:1)

您的基准测试存在以下几个问题:

  1. 您正在对系统-march=native功能进行基准测试,这非常昂贵!
  2. 您错过了编译器-fopenmp以获得AVX / FMA提升
  3. 您缺少initial code: 0.024s after replacing `Random` by `Ones`: 0.018s adding `-march=native`: 0.006s adding `-fopenmp`: 0.003s 以启用多线程。
  4. 在我的四核i7 2.6GHz CPU上,我得到了:

    from channels.sessions import channel_session
    from apscheduler.schedulers.background import BackgroundScheduler
    from oauth2client import transport
    from apiclient.discovery import build
    
    scheduler = BackgroundScheduler()
    
    def tick(group_id):
        user = GoogleUser.objects.all()[0]
        # gets DjangoORMStorage instance
        credentials = get_credentials(user).get() 
    
        # THESE TWO LINES SEEM TO CAUSE THE MEMORY LEAK
        oauth_http = credentials.authorize(transport.get_http_object())
        analytics = build('analytics', 'v3', http=oauth_http)
    
    @channel_session
    def ws_connect(message):
        # accept socket connection and add channel to group
        message.reply_channel.send({"accept": True})
    
        # add channel to websocket channel group
        redis_group = Group(group_id, channel_layer=None)
        redis_group.add(message.reply_channel)
    
        # schedule job
        scheduler.add_job(tick, 'interval', id=slug, kwargs={
            'group_id': group_id,
        }, seconds=settings.INTERVAL)
        scheduler.start()
    

    矩阵有点小,无法获得良好的多线程优势。