Python嵌入后的性能下降

时间:2019-06-24 12:01:16

标签: python c++ numpy pybind11 python-embedding

问题:我正在尝试从C ++(嵌入式Python)调用Python / Numpy函数。但是,与从IPython上看到的性能相比,从C ++调用时,该函数的延迟要慢大约3倍。

这是预期的还是我做错了方法?


Python

下面是我要嵌入的Numpy代码:

# calc.py
import numpy as np
def process(data, S, M):
    return (data-S)*M

使用IPython%timeit,我的时间约为3.5毫秒

import numpy as np
from calc import preprocess
H = 1024
A = np.ones((1, 3, H, H), dtype=np.float32) * 100
B = np.ones((1, 3, 1, 1), dtype=np.float32) * 50
C = np.ones((1, 3, 1, 1), dtype=np.float32) * 0.5
Z = preprocess(A, B, C)

In [8]: %timeit Z = preprocess(A, B, C)
100 loops, best of 3: 3.46 ms per loop

C ++

这是我用来对同一功能进行基准测试的完整C ++代码:profile.cpp。它使用Pybind11库进行嵌入。它是仅标头的库,因此您可以克隆存储库并包含标头以使其正常工作。

#include <pybind11/embed.h>
#include <pybind11/numpy.h>
#include <iostream>
#include <numpy/arrayobject.h>
#include <chrono>
#include <vector>
namespace py = pybind11;


// Convert C++ Vector to py::array_t
template<typename T>
py::array_t<T> cppvector2pyarray(std::vector<T>& vec, const std::vector<int>& shape) {
    std::vector<ssize_t> stride(shape.size(), sizeof(T));
    for(int i=shape.size()-2; i >= 0; --i) {
        stride[i] = stride[i+1] * shape[i+1];
    }

    return py::array_t<T>(shape, stride, vec.data());
}

// Convert py::array_t to C++ vector
template<typename T, typename InputType>
std::vector<T> pyarray2cppvector(InputType& pyarr) {
    PyObject* obj = pyarr.ptr();
    PyArrayObject* numpyobj = (PyArrayObject*) obj;
    int nd = numpyobj->nd;
    int size = 0; 
    for(int i=0; i<nd; i++) { size += numpyobj->dimensions[i]; }
    std::vector<T> cpparr((T*)numpyobj->data, (T*)numpyobj->data+size);
    return cpparr;
}

// A simple timer
template<typename D>
class Timer {
    public:
        using duration = std::chrono::duration<double, D>;
        std::chrono::time_point<std::chrono::high_resolution_clock> _start, _end;
        std::vector<duration> _diffs{0};
        void start() {_start = std::chrono::high_resolution_clock::now(); }
        void stop() {_end = std::chrono::high_resolution_clock::now(); _diffs.push_back(_end - _start);  }
        ssize_t niters() { return _diffs.size(); }
        double total() { return std::accumulate(_diffs.begin(), _diffs.end(), duration()).count(); }
};


Timer<std::milli> timer;
#define H 1024

int main() {
    py::scoped_interpreter guard{};

    // Load the Python module
    py::module calc = py::module::import("calc");
    py::object add  = calc.attr("add");
    py::object preprocess = calc.attr("preprocess");

    // Prepare the inputs
    std::vector<float> A(1*3*H*H, 100.0);
    std::vector<float> B(1*3*1*1, 50.0);
    std::vector<float> C(1*3*1*1, 0.5);
    auto pyA = cppvector2pyarray<float>(A, {1, 3, H, H});
    auto pyB = cppvector2pyarray<float>(B, {1, 3, 1, 1});
    auto pyC = cppvector2pyarray<float>(C, {1, 3, 1, 1});

    // warm up run
    py::object res4 = preprocess(pyA, pyB, pyC);

    // Benchmark
    for(int i=0; i<1e3; ++i) {
        timer.start();
        py::object res4 = preprocess(pyA, pyB, pyC);
        timer.stop();
    }
    std::cout << timer.total() / timer.niters() << std::endl;
}

编译如下:

g++ -O3 -Wall -std=c++14 -march=native -I pybind11/include -I /usr/lib/python2.7/dist-packages/numpy/core/include profile.cpp -o profile `python-config --libs --includes` -lpthread

对C ++代码进行基准测试时,仅执行大约需要9.5毫秒(不包括数据转换功能),这比IPython基准测试要慢大约3倍。


环境

Python : v2.7 OS : Ubuntu 16.04


更新2019年6月25日

Python 3.6尝试过相同的基准测试,但结果仍然相同。

0 个答案:

没有答案