问题:我正在尝试从C ++(嵌入式Python)调用Python / Numpy函数。但是,与从IPython上看到的性能相比,从C ++调用时,该函数的延迟要慢大约3倍。
这是预期的还是我做错了方法?
下面是我要嵌入的Numpy代码:
# calc.py
import numpy as np
def process(data, S, M):
return (data-S)*M
使用IPython%timeit,我的时间约为3.5毫秒
import numpy as np
from calc import preprocess
H = 1024
A = np.ones((1, 3, H, H), dtype=np.float32) * 100
B = np.ones((1, 3, 1, 1), dtype=np.float32) * 50
C = np.ones((1, 3, 1, 1), dtype=np.float32) * 0.5
Z = preprocess(A, B, C)
In [8]: %timeit Z = preprocess(A, B, C)
100 loops, best of 3: 3.46 ms per loop
这是我用来对同一功能进行基准测试的完整C ++代码:profile.cpp。它使用Pybind11库进行嵌入。它是仅标头的库,因此您可以克隆存储库并包含标头以使其正常工作。
#include <pybind11/embed.h>
#include <pybind11/numpy.h>
#include <iostream>
#include <numpy/arrayobject.h>
#include <chrono>
#include <vector>
namespace py = pybind11;
// Convert C++ Vector to py::array_t
template<typename T>
py::array_t<T> cppvector2pyarray(std::vector<T>& vec, const std::vector<int>& shape) {
std::vector<ssize_t> stride(shape.size(), sizeof(T));
for(int i=shape.size()-2; i >= 0; --i) {
stride[i] = stride[i+1] * shape[i+1];
}
return py::array_t<T>(shape, stride, vec.data());
}
// Convert py::array_t to C++ vector
template<typename T, typename InputType>
std::vector<T> pyarray2cppvector(InputType& pyarr) {
PyObject* obj = pyarr.ptr();
PyArrayObject* numpyobj = (PyArrayObject*) obj;
int nd = numpyobj->nd;
int size = 0;
for(int i=0; i<nd; i++) { size += numpyobj->dimensions[i]; }
std::vector<T> cpparr((T*)numpyobj->data, (T*)numpyobj->data+size);
return cpparr;
}
// A simple timer
template<typename D>
class Timer {
public:
using duration = std::chrono::duration<double, D>;
std::chrono::time_point<std::chrono::high_resolution_clock> _start, _end;
std::vector<duration> _diffs{0};
void start() {_start = std::chrono::high_resolution_clock::now(); }
void stop() {_end = std::chrono::high_resolution_clock::now(); _diffs.push_back(_end - _start); }
ssize_t niters() { return _diffs.size(); }
double total() { return std::accumulate(_diffs.begin(), _diffs.end(), duration()).count(); }
};
Timer<std::milli> timer;
#define H 1024
int main() {
py::scoped_interpreter guard{};
// Load the Python module
py::module calc = py::module::import("calc");
py::object add = calc.attr("add");
py::object preprocess = calc.attr("preprocess");
// Prepare the inputs
std::vector<float> A(1*3*H*H, 100.0);
std::vector<float> B(1*3*1*1, 50.0);
std::vector<float> C(1*3*1*1, 0.5);
auto pyA = cppvector2pyarray<float>(A, {1, 3, H, H});
auto pyB = cppvector2pyarray<float>(B, {1, 3, 1, 1});
auto pyC = cppvector2pyarray<float>(C, {1, 3, 1, 1});
// warm up run
py::object res4 = preprocess(pyA, pyB, pyC);
// Benchmark
for(int i=0; i<1e3; ++i) {
timer.start();
py::object res4 = preprocess(pyA, pyB, pyC);
timer.stop();
}
std::cout << timer.total() / timer.niters() << std::endl;
}
编译如下:
g++ -O3 -Wall -std=c++14 -march=native -I pybind11/include -I /usr/lib/python2.7/dist-packages/numpy/core/include profile.cpp -o profile `python-config --libs --includes` -lpthread
对C ++代码进行基准测试时,仅执行大约需要9.5毫秒(不包括数据转换功能),这比IPython基准测试要慢大约3倍。
Python : v2.7
OS : Ubuntu 16.04
更新2019年6月25日
与Python 3.6
尝试过相同的基准测试,但结果仍然相同。