解析大量文本

时间:2011-01-13 23:08:01

标签: c++ string sockets memory-leaks

快速概述该计划 1.)打开与套接字的连接并读取数据 2.)在换行符char处拆分数据 3.)将数据段推送到队列中,以便在单独的线程中处理。

我正在使用curlpp库,因为它处理身份验证和DNS查找。 队列只是一个带有互斥锁的双端队列,用于线程安全。

这是我现在正在使用的方法。

std::string input;
size_t socketIO::dataCallBack(char* ptr, size_t size, size_t nmemb) {
    // Calculate the real size of the incoming buffer
    size_t realsize = size * nmemb;

    //Append the new input to the old input
    input.append(ptr, realsize);

    //Find all the complete strings and push them to the queue
    size_t oldPosition = 0;
    size_t position = 0;
    position = input.find('\r', oldPosition);
    while (position != std::string::npos) {
        queueObject.push(input.substr(oldPosition, position))
        oldPosition = position + 1;
        position = input.find('\r', oldPosition);
    }

    //Save off the partial string as you'll get the rest of it on the next data callback
    input = input.substr(oldPosition);

    return realsize;
}

我有一些顾虑。我遇到内存泄漏问题,而valgrind显示此函数出现重大泄漏。

==12867== 813,287,102 bytes in 390,337 blocks are possibly lost in loss record 359 of 359
==12867==    at 0x4C27CC1: operator new(unsigned long) (vg_replace_malloc.c:261)
==12867==    by 0x5AA8D98: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib/libstdc++.so.6.0.13)
==12867==    by 0x5AA9B64: ??? (in /usr/lib/libstdc++.so.6.0.13)
==12867==    by 0x5AA9D38: std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&, unsigned long, unsigned long) (in /usr/lib/libstdc++.so.6.0.13)
==12867==    by 0x41E4F5: socketIO::write(char*, unsigned long, unsigned long) (basic_string.h:2006)
==12867==    by 0x509C657: utilspp::Functor<unsigned long, utilspp::tl::TypeList<char*, utilspp::tl::TypeList<unsigned long, utilspp::tl::TypeList<unsigned long, utilspp::NullType> > > >::operator()(char*, unsigned long, unsigned long) (Functor.hpp:106)
==12867==    by 0x509B6E4: curlpp::internal::CurlHandle::executeWriteFunctor(char*, unsigned long, unsigned long) (CurlHandle.cpp:171)
==12867==    by 0x509F509: curlpp::internal::Callbacks::WriteCallback(char*, unsigned long, unsigned long, curlpp::internal::CurlHandle*) (OptionSetter.cpp:47)
==12867==    by 0x4E3D667: ??? (in /usr/lib/libcurl-gnutls.so.4.1.1)
==12867==    by 0x4E5407B: ??? (in /usr/lib/libcurl-gnutls.so.4.1.1)
==12867==    by 0x4E505A1: ??? (in /usr/lib/libcurl-gnutls.so.4.1.1)
==12867==    by 0x4E51A8F: ??? (in /usr/lib/libcurl-gnutls.so.4.1.1)
==12867==    by 0x509A78B: curlpp::internal::CurlHandle::perform() (CurlHandle.cpp:52)
==12867==    by 0x5093A6B: curlpp::Easy::perform() (Easy.cpp:48)
==12867==    by 0x41EDC3: socketIO::processLoop() (socketIO.cpp:126)

你建议做什么?我考虑过使用istringstream,但我不确定它的内存分配是如何工作的,以及它是否会回收我已经读过的内存。我有一个问题,我需要在回调之间保持数据,但要以不会泄漏内存的方式进行。

更新 按要求提供更多代码。我发布了更多更好的想法。

的main.cpp

/**
 * The main driver for the twitter capture app.  Starts multiple threads for processors, 1 io thread and 2 db threads. One for user
 * information and the other for tweet information 
 */

#include "types.h"
#include "threadBase.h"
#include "socketIO.h"
#include "processor.h"
#include "dbTweetQueue.h"
#include "dbUserQueue.h"

#include <vector>


stringQueue twitToProc;
tweetQueue tweetQ;
userQueue userQ;
deleteQueue deleteQ;
std::vector<ThreadBase *> threadGroup;

std::string dbBase::dbUser(DBUSER);
std::string dbBase::dbURL(DBURL);
std::string dbBase::dbPass(DBPASS);

/*
 * Handle the signal for interupt
 */
void sigquit(int param)
{
    std::cout<<"Received sigquit"<<std::endl;
    for(unsigned int i = 0; i < threadGroup.size(); i++)
    {
        threadGroup[i]->interupt();
    }
}


int main(int argc, char* argv[])
{
    try{
    //Setting the signal handler up.
    struct sigaction act;
    act.sa_handler = sigquit;
    sigemptyset(&act.sa_mask);
    act.sa_flags = 0;
    sigaction(SIGQUIT, &act, 0);


    int MaxThreads = 5;
    if(argc < 3)
    {
        std::cout<<"Usage: >"<<argv[0]<<" TwitterUserName TwitterPassWord"<<std::endl;
        std::cout<<"Using Defaults: "<<TWITTERACCT<<" "<<TWITTERPASS<<std::endl;
    }

    // Create socketIO, and add it to the thread group
    if(argc == 3)
    {
        threadGroup.push_back(new socketIO(twitToProc, argv[1], argv[2]));
    }
    else
    {
        threadGroup.push_back(new socketIO(twitToProc));
    }


   // Create processorThreads and add them to the thread group
    for(int i = 0; i < MaxThreads; i++)
    {
        threadGroup.push_back(new processor(twitToProc, tweetQ, deleteQ, userQ));
    }

    //Create DB Threads and add them to the thread group.
    threadGroup.push_back(new dbTweetQueue(tweetQ, deleteQ));
    threadGroup.push_back(new dbUserQueue(userQ));


    // Start the threads
    for(unsigned int i = 0; i < threadGroup.size(); i++)
    {
        threadGroup[i]->start();
    }

    // Join the threads
    for(unsigned int i = 0; i < threadGroup.size(); i++)
    {
        threadGroup[i]->join();
    }

           } catch (std::exception & e) {
            std::cerr << e.what() <<  std::endl;
        } 

    for(unsigned int i = 0; i < threadGroup.size(); i++)
    {
        threadGroup[i]->();
    }
    return 0;
}

threadBase.h

#ifndef _THREADBASE_H
#define _THREADBASE_H

#include <boost/thread.hpp>

class ThreadBase
{
public:
    virtual void join() = 0;
    virtual void start() = 0;
    void interupt(){thread.interrupt();}
protected:
    boost::thread thread;

};



#endif  /* _THREADBASE_H */

socketIO.h

#ifndef _SOCKETIO_H
#define _SOCKETIO_H

#include "types.h"
#include "threadBase.h"

#include <boost/bind.hpp>
#include <curlpp/cURLpp.hpp>
#include <curlpp/Multi.hpp>
#include <curlpp/Easy.hpp>
#include <curlpp/Options.hpp>
#include <curlpp/Exception.hpp>
#include <curlpp/Infos.hpp>
#include <curl/curl.h>

#include <signal.h>
#include <string>
#include <sstream>
#include <cstdlib>


#define defaultRepeatInterval 10;

class socketIO: public ThreadBase {
private:
    int repeatInterval;
    double previousDownloadSize;
    int failCount;
    int writeRound;
    std::string userPassword;
    stringQueue&  queueObject;
    std::string input;


public:
    socketIO(stringQueue & messageQueue):
                queueObject(messageQueue)
    {
        userPassword.append(TWITTERACCT);
        userPassword.append(":");
        userPassword.append(TWITTERPASS);
    }

    socketIO(stringQueue & messageQueue, char* userName, char* password):
                queueObject(messageQueue)
    {
        userPassword.append(userName);
        userPassword.append(":");
        userPassword.append(password);
    }

    virtual ~socketIO();

    void join();
    void start();
    std::auto_ptr<curlpp::Easy> createRequest(int);



    void processLoop();
    size_t write(char* ptr, size_t size, size_t nmemb);
    int progress(double, double, double, double);

};

#endif  /* _SOCKETIO_H */

socketIO.cpp

#include "socketIO.h"

socketIO::~socketIO() {
}

/*
 * This method starts a new thread with the processLoop method
 */
void socketIO::start() {
    thread = boost::thread(&socketIO::processLoop, this);
}

/*
 * This method blocks waiting for the thread to exit
 */
void socketIO::join() {
    thread.join();
}

/*
 * The datacall back function for the open twitter connection.\
 */
size_t socketIO::write(char* ptr, size_t size, size_t nmemb) {
    // Calculate the real size of the incoming buffer
    size_t realsize = size * nmemb;
    std::string temp;
    temp.append(input);
    temp.append(ptr, realsize);
    size_t oldPosition = 0;
    size_t position = 0;
    position = temp.find('\r', oldPosition);
    while (position != std::string::npos) {
        queueObject.push(temp.substr(oldPosition, position));
        ++writeRound;
        oldPosition = position + 1;
        position = temp.find('\r', oldPosition);
    }
    input = temp.substr(oldPosition);
    return realsize;
}

/*
 * The timed callback function, called every second, used to monitor that the connection is still receiving data
 * Return 1 if requesting break or data flow stops, 0 if continuing normally
 */
int socketIO::progress(double dltotal, double dlnow, double ultotal, double ulnow) {
    // Allows us to break out on interruption
    if (boost::this_thread::interruption_requested())
        return 1;

    if (dlnow == previousDownloadSize) {
        if (failCount < 15)
            failCount++;
        else {
            repeatInterval = repeatInterval * 2;
            return 1;
        }
    } else {
        repeatInterval = 10;
        previousDownloadSize = dlnow;
    }
    return 0;
}

/*
 * This method creates a new connection to the twitter service with the required settings
 */
std::auto_ptr<curlpp::Easy> socketIO::createRequest(int source) {
    //Reset the input buffer when the connection is made.
    input = std::string("");
    std::auto_ptr<curlpp::Easy> newRequest(new curlpp::Easy);

    curlpp::types::ProgressFunctionFunctor progressFunctor(this, &socketIO::progress);
    newRequest->setOpt(new curlpp::options::ProgressFunction(progressFunctor));

    curlpp::types::WriteFunctionFunctor functor(this, &socketIO::write);
    newRequest->setOpt(new curlpp::options::WriteFunction(functor));

    newRequest->setOpt(new curlpp::options::FailOnError(true));
    newRequest->setOpt(new curlpp::options::NoProgress(0));
    newRequest->setOpt(new curlpp::options::Verbose(true));
    newRequest->setOpt(new curlpp::options::UserPwd(userPassword));


    //Code for debugging and using alternate sources
    std::string params = "track=basketball,football,baseball,footy,soccer";

    switch (source) {
        case 1: // Testing Locally
            newRequest->setOpt(new curlpp::options::Url("127.0.0.1:17000"));
            break;
        case 2: // Filtered
            newRequest->setOpt(new curlpp::options::Url("http://stream.twitter.com/1/statuses/filter.json"));
            newRequest->setOpt(new curlpp::options::PostFields(params));
            newRequest->setOpt(new curlpp::options::PostFieldSize(params.size()));
            break;
        case 3: //Twitter Main Stream
            newRequest->setOpt(new curlpp::options::Url("http://stream.twitter.com/1/statuses/sample.json"));
            break;
    }

    return newRequest;
}


/*
 * The main method of the thread.  Creates a new instance of the request
 */
void socketIO::processLoop() {
    repeatInterval = defaultRepeatInterval;
    std::auto_ptr<curlpp::Easy> request;
    while (true) {
        try {
            previousDownloadSize = 0;
            failCount = 0;
            request.reset(createRequest(3));
            request->perform();
        } catch (curlpp::UnknowException & e) {
            std::cout << "Unknown Exception: " << e.what() << std::endl;
        } catch (curlpp::RuntimeError & e) {
            std::cout << "Runtime Exception: " << e.what() << std::endl;
        } catch (curlpp::LogicError & e) {
            std::cout << "Logic Exception: " << e.what() << std::endl;
        }


        if (boost::this_thread::interruption_requested())
            break;
        else
            boost::this_thread::sleep(boost::posix_time::seconds(repeatInterval));
    }
}

types.h中

#ifndef _TYPES_H
#define _TYPES_H

#include <string>
#include <concurrent_queue.hpp>

#define DBUSER "****"
#define DBPASS "****"
#define DBURL "****"
#define TWITTERACCT "****"
#define TWITTERPASS "****"

typedef struct tweet {
...
} tweet;

typedef struct user {
...
} user;


typedef concurrent_queue<std::string> stringQueue;
typedef std::pair<int, std::string> dbPair;
typedef concurrent_queue<dbPair> dbQueue;

typedef concurrent_queue<tweet> tweetQueue;
typedef concurrent_queue<user> userQueue;
typedef concurrent_queue<boost::int64_t> deleteQueue;

#endif  /* _TYPES_H */

concurrent_queue.hpp

#ifndef _CONCURRENT_QUEUE_
#define _CONCURRENT_QUEUE_

#include <boost/thread/mutex.hpp>
#include <boost/thread/condition_variable.hpp>
#include <deque>

template<typename Data>
class concurrent_queue
{
private:
    std::deque<Data> the_queue;
    mutable boost::mutex the_mutex;
    boost::condition_variable the_condition_variable;
public:
    void push(Data const& data)
    {
        boost::mutex::scoped_lock lock(the_mutex);
        the_queue.push_back(data);
        lock.unlock();
        the_condition_variable.notify_one();
    }

    bool empty() const
    {
        boost::mutex::scoped_lock lock(the_mutex);
        return the_queue.empty();
    }

    bool try_pop(Data& popped_value)
    {
        boost::mutex::scoped_lock lock(the_mutex);
        if(the_queue.empty())
        {
            return false;
        }

        popped_value=the_queue.front();
        the_queue.pop_front();
        return true;
    }

    void wait_and_pop(Data& popped_value)
    {
        boost::mutex::scoped_lock lock(the_mutex);
        while(the_queue.empty())
        {
            the_condition_variable.wait(lock);
        }

        popped_value=the_queue.front();
        the_queue.pop_front();
    }

};

#endif  /* _CONCURRENT_QUEUE_ */

4 个答案:

答案 0 :(得分:3)

并不是说我确实有足够的信息来确定这个答案,但这是我的猜测。

通过查看valgrind堆栈的内存分配时间,您会看到:

==12867== 813,287,102 bytes in 390,337 blocks are possibly lost in loss record 359 of 359
==12867==    at 0x4C27CC1: operator new(unsigned long) (vg_replace_malloc.c:261)
==12867==    by 0x5AA8D98: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib/libstdc++.so.6.0.13)
==12867==    by 0x5AA9B64: ??? (in /usr/lib/libstdc++.so.6.0.13)
==12867==    by 0x5AA9D38: std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&, unsigned long, unsigned long) (in /usr/lib/libstdc++.so.6.0.13)
==12867==    by 0x41E4F5: socketIO::write(char*, unsigned long, unsigned long) (basic_string.h:2006)

这几乎意味着字符串是在write-method中创建的。 std :: string,就像大多数stl容器一样,在必要时不会在堆上分配任何东西,在这种情况下,就是在向它追加数据的时候。

现在,内存已经分配,​​这很好,但它永远不会被释放,因为从不调用 std :: string input 的析构函数。可能有几个原因,但最常见的是:

  • 您堆积已分配的socketIO并忘记释放它。
  • 你有虚函数,但在某个地方忘记了虚拟析构函数。

答案 1 :(得分:2)

ThreadBase没有虚拟析构函数。

每当对象指向的delete不是ThreadBase*而是派生类型时,ThreadBase应用于class ThreadBase { public: virtual ~ThreadBase() {} // <-- There you go! virtual void join() = 0; virtual void start() = 0; void interupt() { thread.interrupt(); } protected: boost::thread thread; }; 的结果未定义。实际上,如果任何派生类分配内存(直接或间接),它通常是泄漏。

protected

从设计的角度来看:

  • 避免public属性,更喜欢提供封装其使用的方法。
  • NVI(非虚拟接口)习惯用法规定使用同时为virtualThreadBase的方法是一个坏主意(例如,无法检查前置条件和后置条件),最好使用公共非 - 虚拟方法,它将为实现细节调用私有虚方法。
  • 你可能boost::noncopyable私下从{{1}}继承,以证明它不可复制。

答案 2 :(得分:2)

我知道我在这里相当迟到(事实上已经超过几个月),但是如果其他人正在关注这个帖子,我有一个类似的问题,我已经设法将其追溯到curlpp库。< / p>

话虽这么说,我远非C ++专家,我很确定这是我使用库的方式。他们说他们使用RAII风格清理内存,但即使在我的请求中明确创建和销毁选项设置(在我的程序执行期间重新使用),我仍然看到过程内存占用的严重增长。

当我删除对curlpp库的调用时,我的程序在非常静态的情况下运行,只要它的内存要求。他们提供的大多数示例都是简单的程序,其中main()执行某些操作并退出,因此创建一个守护程序(如可执行文件)并不是那么简单,该守护程序使用的是具有在实例化时创建的Easy HTTPClient(我正在使用的)的类并在整个程序的执行过程中重复使用。

答案 3 :(得分:1)

问题似乎完全或与此代码及其他地方相结合。

您发布的代码中没有内存泄漏。您可能已经转换了在SO上发布的代码,这些代码遗漏了重要的细节。例如,您已经省去了锁定队列(您提到的是必需的,我相信您确实这样做),这可能导致损坏和泄漏。另一个例子是输入变量:它真的是全局的还是数据成员? Mic mentions更多可能存在或不存在转录错误的潜在错误。

我们真的需要一个完整的,可编译的例子来证明这个问题。