神经网络的Python实时图像分类问题

时间:2016-09-16 01:46:46

标签: python multiprocessing deep-learning caffe gpgpu

我正在尝试使用caffe和python进行实时图像分类。我在一个过程中使用OpenCV从我的网络摄像头流式传输,并在一个单独的过程中,使用caffe对从网络摄像头拉出的帧执行图像分类。然后我将分类结果传回主线程,为网络摄像头流标题。

问题是即使我有一个NVIDIA GPU并且正在执行GPU上的caffe预测,主线程也会变慢。通常没有做任何预测,我的网络摄像头流以30 fps运行;但是,根据预测,我的网络摄像头流最多可获得15 fps。

我已经确认caffe在执行预测时确实使用了GPU,而且我的GPU或GPU内存并没有最大化。我还验证了我的CPU内核在程序中的任何时候都没有达到最大限度。我想知道我是做错了什么,还是没有办法让这两个过程真正分开。任何建议表示赞赏。这是我的参考代码

class Consumer(multiprocessing.Process):

    def __init__(self, task_queue, result_queue):
        multiprocessing.Process.__init__(self)
        self.task_queue = task_queue
        self.result_queue = result_queue
        #other initialization stuff

    def run(self):
        caffe.set_mode_gpu()
        caffe.set_device(0)
        #Load caffe net -- code omitted 
        while True:
            image = self.task_queue.get()
            #crop image -- code omitted
            text = net.predict(image)
            self.result_queue.put(text)

        return

import cv2
import caffe
import multiprocessing
import Queue 

tasks = multiprocessing.Queue()
results = multiprocessing.Queue()
consumer = Consumer(tasks,results)
consumer.start()

#Creating window and starting video capturer from camera
cv2.namedWindow("preview")
vc = cv2.VideoCapture(0)
#Try to get the first frame
if vc.isOpened():
    rval, frame = vc.read()
else:
    rval = False
frame_copy[:] = frame
task_empty = True
while rval:
    if task_empty:
       tasks.put(frame_copy)
       task_empty = False
    if not results.empty():
       text = results.get()
       #Add text to frame
       cv2.putText(frame,text)
       task_empty = True

    #Showing the frame with all the applied modifications
    cv2.imshow("preview", frame)

    #Getting next frame from camera
    rval, frame = vc.read()
    frame_copy[:] = frame
    #Getting keyboard input 
    key = cv2.waitKey(1)
    #exit on ESC
    if key == 27:
        break

我很确定这是因为caffe预测减慢了一切,因为当我评论预测并在进程之间来回传递虚拟文本时,我再次获得30 fps。

class Consumer(multiprocessing.Process):

    def __init__(self, task_queue, result_queue):
        multiprocessing.Process.__init__(self)
        self.task_queue = task_queue
        self.result_queue = result_queue
        #other initialization stuff

    def run(self):
        caffe.set_mode_gpu()
        caffe.set_device(0)
        #Load caffe net -- code omitted
        while True:
            image = self.task_queue.get()
            #crop image -- code omitted
            #text = net.predict(image)
            text = "dummy text"
            self.result_queue.put(text)

        return

import cv2
import caffe
import multiprocessing
import Queue 

tasks = multiprocessing.Queue()
results = multiprocessing.Queue()
consumer = Consumer(tasks,results)
consumer.start()

#Creating window and starting video capturer from camera
cv2.namedWindow("preview")
vc = cv2.VideoCapture(0)
#Try to get the first frame
if vc.isOpened():
    rval, frame = vc.read()
else:
    rval = False
frame_copy[:] = frame
task_empty = True
while rval:
    if task_empty:
       tasks.put(frame_copy)
       task_empty = False
    if not results.empty():
       text = results.get()
       #Add text to frame
       cv2.putText(frame,text)
       task_empty = True

    #Showing the frame with all the applied modifications
    cv2.imshow("preview", frame)

    #Getting next frame from camera
    rval, frame = vc.read()
    frame_copy[:] = frame
    #Getting keyboard input 
    key = cv2.waitKey(1)
    #exit on ESC
    if key == 27:
        break

4 个答案:

答案 0 :(得分:4)

一些解释和一些反思:

  1. 我在带有Intel Core i5-6300HQ @2.3GHz cpu,8 GB RAMNVIDIA GeForce GTX 960M gpu(2GB内存)的笔记本电脑上运行我的代码,结果是:

    无论我是否运行caffe运行代码(通过注释net_output = this->net_->Forward(net_input)以及void Consumer::entry()中的一些必要内容),我总是可以在主线程中获得大约30 fps。

    在具有Intel Core i5-4440 cpu,8 GB RAMNVIDIA GeForce GT 630 gpu(1GB内存)的PC上获得了类似的结果。

  2. 我在同一台笔记本电脑上运行问题代码@user3543300,结果是:

    无论caffe是否正在运行(在gpu上),我也可以达到30 fps左右。

  3. 根据@user3543300的反馈,使用上面提到的2个版本的代码,@ user3543300在运行caffe时只能获得大约15 fps(在Nvidia GeForce 940MX GPU and Intel® Core™ i7-6500U CPU @ 2.50GHz × 4笔记本电脑上)。 当caffe在作为独立程序的gpu上运行时,网络摄像头的帧速率也会降低。

  4. 所以我仍然认为问题可能最主要在于硬件I / O限制,例如DMA带宽(这个约为DMA的线程可能暗示。)或RAM带宽。 希望@user3543300可以检查这一点,或找出我尚未意识到的真正问题。

    如果问题确实是我上面提到的,那么一个明智的想法是减少CNN网络引入的内存I / O开销。实际上,为了解决具有有限硬件资源的嵌入式系统上的类似问题,已经对该主题进行了一些研究,例如, Qautization Structurally Sparse Deep Neural NetworksSqueezeNetDeep-Compression。所以希望通过应用这些技能,它还有助于提高网络摄像头的帧率。

    原始答案:

    试试这个c ++解决方案。它在您的任务中使用I/O overhead的线程,我使用bvlc_alexnet.caffemodeldeploy.prototxt对其进行了测试以进行图像分类,并且没有看到明显减慢主线程(网络摄像头流)当caffe运行时(在GPU上):

    #include <stdio.h>
    #include <iostream>
    #include <string>
    #include <boost/thread.hpp>
    #include <boost/shared_ptr.hpp>
    #include "caffe/caffe.hpp"
    #include "caffe/util/blocking_queue.hpp"
    #include "caffe/data_transformer.hpp"
    #include "opencv2/opencv.hpp"
    
    using namespace cv;
    
    //Queue pair for sharing image/results between webcam and caffe threads
    template<typename T>
    class QueuePair {
      public:
        explicit QueuePair(int size);
        ~QueuePair();
    
        caffe::BlockingQueue<T*> free_;
        caffe::BlockingQueue<T*> full_;
    
      DISABLE_COPY_AND_ASSIGN(QueuePair);
    };
    template<typename T>
    QueuePair<T>::QueuePair(int size) {
      // Initialize the free queue
      for (int i = 0; i < size; ++i) {
        free_.push(new T);
      }
    }
    template<typename T>
    QueuePair<T>::~QueuePair(){
      T *data;
      while (free_.try_pop(&data)){
        delete data;
      }
      while (full_.try_pop(&data)){
        delete data;
      }
    }
    template class QueuePair<Mat>;
    template class QueuePair<std::string>;
    
    //Do image classification(caffe predict) using a subthread
    class Consumer{
      public:
        Consumer(boost::shared_ptr<QueuePair<Mat>> task
               , boost::shared_ptr<QueuePair<std::string>> result);
        ~Consumer();
        void Run();
        void Stop();
        void entry(boost::shared_ptr<QueuePair<Mat>> task
                 , boost::shared_ptr<QueuePair<std::string>> result);
    
      private:
        bool must_stop();
    
        boost::shared_ptr<QueuePair<Mat> > task_q_;
        boost::shared_ptr<QueuePair<std::string> > result_q_;
    
        //caffe::Blob<float> *net_input_blob_;
        boost::shared_ptr<caffe::DataTransformer<float> > data_transformer_;
        boost::shared_ptr<caffe::Net<float> > net_;
        std::vector<std::string> synset_words_;
        boost::shared_ptr<boost::thread> thread_;
    };
    Consumer::Consumer(boost::shared_ptr<QueuePair<Mat>> task
                     , boost::shared_ptr<QueuePair<std::string>> result) :
     task_q_(task), result_q_(result), thread_(){
    
      //for data preprocess
      caffe::TransformationParameter trans_para;
      //set mean
      trans_para.set_mean_file("/path/to/imagenet_mean.binaryproto");
      //set crop size, here is cropping 227x227 from 256x256
      trans_para.set_crop_size(227);
      //instantiate a DataTransformer using trans_para for image preprocess
      data_transformer_.reset(new caffe::DataTransformer<float>(trans_para
                            , caffe::TEST));
    
      //initialize a caffe net
      net_.reset(new caffe::Net<float>(std::string("/path/to/deploy.prototxt")
               , caffe::TEST));
      //net parameter
      net_->CopyTrainedLayersFrom(std::string("/path/to/bvlc_alexnet.caffemodel"));
    
      std::fstream synset_word("path/to/caffe/data/ilsvrc12/synset_words.txt");
      std::string line;
      if (!synset_word.good()){
        std::cerr << "synset words open failed!" << std::endl;
      }
      while (std::getline(synset_word, line)){
        synset_words_.push_back(line.substr(line.find_first_of(' '), line.length()));
      }
      //a container for net input, holds data converted from cv::Mat
      //net_input_blob_ = new caffe::Blob<float>(1, 3, 227, 227);
    }
    Consumer::~Consumer(){
      Stop();
      //delete net_input_blob_;
    }
    void Consumer::entry(boost::shared_ptr<QueuePair<Mat>> task
        , boost::shared_ptr<QueuePair<std::string>> result){
    
      caffe::Caffe::set_mode(caffe::Caffe::GPU);
      caffe::Caffe::SetDevice(0);
    
      cv::Mat *frame;
      cv::Mat resized_image(256, 256, CV_8UC3);
      cv::Size re_size(resized_image.cols, resized_image.rows);
    
      //for caffe input and output
      const std::vector<caffe::Blob<float> *> net_input = this->net_->input_blobs();
      std::vector<caffe::Blob<float> *> net_output;
    
      //net_input.push_back(net_input_blob_);
      std::string *res;
    
      int pre_num = 1;
      while (!must_stop()){
        std::stringstream result_strm;
        frame = task->full_.pop();
        cv::resize(*frame, resized_image, re_size, 0, 0, CV_INTER_LINEAR);
        this->data_transformer_->Transform(resized_image, *net_input[0]);
        net_output = this->net_->Forward();
        task->free_.push(frame);
    
        res = result->free_.pop();
        //Process results here
        for (int i = 0; i < pre_num; ++i){
          result_strm << synset_words_[net_output[0]->cpu_data()[i]] << " " 
                      << net_output[0]->cpu_data()[i + pre_num] << "\n";
        }
        *res = result_strm.str();
        result->full_.push(res);
      }
    }
    
    void Consumer::Run(){
      if (!thread_){
        try{
          thread_.reset(new boost::thread(&Consumer::entry, this, task_q_, result_q_));
        }
        catch (std::exception& e) {
          std::cerr << "Thread exception: " << e.what() << std::endl;
        }
      }
      else
        std::cout << "Consumer thread may have been running!" << std::endl;
    };
    void Consumer::Stop(){
      if (thread_ && thread_->joinable()){
        thread_->interrupt();
        try {
          thread_->join();
        }
        catch (boost::thread_interrupted&) {
        }
        catch (std::exception& e) {
          std::cerr << "Thread exception: " << e.what() << std::endl;
        }
      }
    }
    bool Consumer::must_stop(){
      return thread_ && thread_->interruption_requested();
    }
    
    
    int main(void)
    {
      int max_queue_size = 1000;
      boost::shared_ptr<QueuePair<Mat>> tasks(new QueuePair<Mat>(max_queue_size));
      boost::shared_ptr<QueuePair<std::string>> results(new QueuePair<std::string>(max_queue_size));
    
      char str[100], info_str[100] = " results: ";
      VideoCapture vc(0);
      if (!vc.isOpened())
        return -1;
    
      Consumer consumer(tasks, results);
      consumer.Run();
    
      Mat frame, *frame_copy;
      namedWindow("preview");
      double t, fps;
    
      while (true){
        t = (double)getTickCount();
        vc.read(frame);
    
        if (waitKey(1) >= 0){
          consuer.Stop();
          break;
        }
    
        if (tasks->free_.try_peek(&frame_copy)){
          frame_copy = tasks->free_.pop();
          *frame_copy = frame.clone();
          tasks->full_.push(frame_copy);
        }
        std::string *res;
        std::string frame_info("");
        if (results->full_.try_peek(&res)){
          res = results->full_.pop();
          frame_info = frame_info + info_str;
          frame_info = frame_info + *res;
          results->free_.push(res);
        }    
    
        t = ((double)getTickCount() - t) / getTickFrequency();
        fps = 1.0 / t;
    
        sprintf(str, " fps: %.2f", fps);
        frame_info = frame_info + str;
    
        putText(frame, frame_info, Point(5, 20)
             , FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 255, 0));
        imshow("preview", frame);
      }
    }
    

    src/caffe/util/blocking_queue.cpp中,在下面做一点改动并重建caffe:

    ...//Other stuff
    template class BlockingQueue<Batch<float>*>;
    template class BlockingQueue<Batch<double>*>;
    template class BlockingQueue<Datum*>;
    template class BlockingQueue<shared_ptr<DataReader::QueuePair> >;
    template class BlockingQueue<P2PSync<float>*>;
    template class BlockingQueue<P2PSync<double>*>;
    //add these 2 lines below
    template class BlockingQueue<cv::Mat*>;
    template class BlockingQueue<std::string*>;
    

答案 1 :(得分:2)

似乎caffe的python包装器阻挡了Global Interpreter Lock (GIL)。因此,调用任何caffe python命令会阻止所有 python线程。

解决方法(风险自负)将禁用特定caffe函数的GIL。例如,如果您希望能够在没有锁定的情况下运行forward,则可以编辑$CAFFE_ROOT/python/caffe/_caffe.cpp。添加此功能:

void Net_Forward(Net<Dtype>& net, int start, int end) {
  Py_BEGIN_ALLOW_THREADS;   // <-- disable GIL
  net.ForwardFromTo(start, end);
  Py_END_ALLOW_THREADS;     // <-- restore GIL
}

.def("_forward", &Net<Dtype>::ForwardFromTo)替换为:

.def("_forward", &Net_Forward)

改变后不要忘记make pycaffe

有关详细信息,请参阅this

答案 2 :(得分:0)

您的代码中可能会发生一种想法,即它在第一次调用的gpu模式下工作,在以后的调用中,它会在cpu模式下计算分类,因为它是默认模式。在较旧版本的caffe上设置gpu模式一次就够了,现在每个版本都需要设置模式。您可以尝试以下更改:

def run(self):

        #Load caffe net -- code omitted 
        while True:
            caffe.set_mode_gpu()
            caffe.set_device(0)
            image = self.task_queue.get()
            #crop image -- code omitted
            text = net.predict(image)
            self.result_queue.put(text)

        return

另外,请在使用者线程运行时查看gpu计时。您可以对nvidia使用以下命令:

nvidia-smi

上面的命令将显示运行时的gpu利用率。

如果不能解决另一个问题,请在线程下创建opencv帧提取代码。由于它与I / O和设备访问有关,因此可以从GUI线程/主线程的单独线程上运行它。该线程将推送队列中的帧,当前的消费者线程将预测。在这种情况下,请小心处理具有关键块的队列。

答案 3 :(得分:0)

尝试多线程方法而不是多处理方法。产卵过程比产生线程要慢。一旦它们运行,就没有太大区别。在你的情况下,我认为线程方法将受益,因为涉及的帧数据太多了。