Question

我目前正在Linux上用C ++编写一个Web爬虫/蜘蛛，我在更新数据库方面遇到了一些问题。我是相当新的C / C ++，只是我们。

数据库更新是由一个单独的线程（使用pthreads）执行的，但是如果在main（）中执行则存在同样的问题，所以我可能会天真地丢弃线程内容作为任何原因。

我正在使用libmysqlcppconn作为数据库API。

我使用-O2 -Wall -pedantic编译gcc版本4.4.3（Ubuntu 4.4.3-4ubuntu5.1）并且编译得很干净。

然而，当调用下面的函数commitChangesToDatabase（）时，它基本上从std :: map（url_queue）中选择项目，将它们抛出到std :: vector（更新）中并从原始std中删除所述项目：：map，然后继续迭代std :: vector，为向量中的每个项执行MySQL预处理语句。这是它失败的地方。

它随机地：

在没有任何错误输出的情况下崩溃（没有segfault，没有stacktrace，没有任何内容）
检测到glibc内存损坏时崩溃（请参阅此处的输出：http://pastie.org/private/wlkuorivq5tptlcr7ojg）
报告MySQL服务器已经消失（捕获异常），但继续尝试（不会崩溃）

我已经尝试将准备好的语句切换到一个简单的executeUpdate（），但无济于事。我已经尝试取消选择项目的步骤，而只是在我找到要更新的项目时执行更新，在第一个循环中通过url_queue。

此应用程序中的其他函数也使用预准备语句（另一个UPDATE），并且工作正常。这些函数也由单独的线程运行。

我会通过valgrind运行应用程序，但坦率地说，我不理解大部分输出所以它对我没什么帮助 - 但如果有人想要它的输出，让我知道运行它的选项和我一起提供。

我不知道如何从这里开始。任何人都知道什么是错的？

struct queue_item_t {
    int id;
    int sites_id;
    int priority;
    int depth;
    int handler;
    int state;  // 0 = Pending, 1 = Working, 2 = Completed, 3 = Checked
    double time_allowed_crawl;

    bool status;
    bool was_redirected;

    double time;
    double time_end;
    double time_curl;
    double size;

    std::string hash;
    std::string url;
    std::string file;
    std::string host;
};

void commitChangesToDatabase()
{
    map< string, queue_item_t >::iterator it, end;
    sql::PreparedStatement *pstmt;
    int i = 0;

    if (!url_queue.size()) {
        return;
    }

    pthread_mutex_lock(&dbCommitMutex);
    pthread_mutex_lock(&itemMutex);

    cout << "commitChangesToDatabase()" << endl;
    pstmt = dbPrepareStatement("UPDATE crawler_queue SET process_hash = NULL, date_crawled = NOW(), url = ?, hash = ? WHERE id = ?");

    for (it = url_queue.begin(); it != url_queue.end();)
    {
        if (it->second.state == 2)
        {
            pstmt->setString(1, it->second.url);
            pstmt->setString(2, it->second.hash);
            pstmt->setInt(3, it->second.id);

            try {
                pstmt->executeUpdate();
                ++i;

            } catch (sql::SQLException &e) {
                cerr << "# ERR: SQLException in " << __FILE__;
                cerr << "(" << __FUNCTION__ << ") on line " << __LINE__ << endl;
                cerr << "# ERR: " << e.what();
                cerr << " (MySQL error code: " << e.getErrorCode();
                cerr << ", SQLState: " << e.getSQLState() << " )" << endl;
            }

            url_queue.erase(it++);
        }
        else {
            ++it;
        }
    }

    delete pstmt;

    cout << "~commitChangesToDatabase()" << endl;

    pthread_mutex_unlock(&itemMutex);
    pthread_mutex_unlock(&dbCommitMutex);
}

// this function is defined in another file but is written here just to show the contents of it
sql::PreparedStatement *dbPrepareStatement(const std::string &query)
{
    return con->prepareStatement(query);
}

编辑：

有些人似乎认为问题在于对url_queue集合的迭代，但是我已经排除了这一点，但是注释掉了对数据库进行操作的所有操作，而不是迭代。此外，这里的迭代是原始的简化（但工作）版本，它从地图中挑选出项目，抛出一个向量并从地图中删除，如下所示，并且程序的那部分工作正常 - 它只有在使用数据库时崩溃。

for (it = url_queue.begin(); it != url_queue.end();)
{
    if (it->second.state == 2)
    {
        update_item.type = (!it->second.was_redirected ? 1 : 2);
        update_item.item = it->second;

        updates.push_back(update_item);

        url_queue.erase(it++);
    }
    else {
        ++it;
    }
}

编辑2：

valgrind --leak-check=yes的输出：http://pastie.org/private/2ypk0bmawwsqva3ikfazw

Answer 1

看来，迭代器不必要地增加;首先在循环体中，也在for语句中。在此代码中，可以递增end迭代器，这是一个有问题的操作，可能是问题的根源。

以下循环结构更适合这种情况：
it = url_queue.begin();
while( it != url_queue.end() ){ //loop body }

Answer 2

我不认为混淆迭代器是个好主意。替换：

else {
        ++it;
    }

由：

else continue;

或者只是删除它。

glibc内存损坏与libmysqlcppconn预处理语句

2 个答案: