Question

我有问题。

所以，我的故事是：

我有一个30 GB的大文件（JSON），其中包含特定时间范围内的所有reddit帖子。我不会将每个帖子的所有值插入表中。

我遵循了this series，他用我的Python编写了代码。我尝试遵循（在NodeJS中），但是当我对其进行测试时，它的速度太慢了。每5秒插入一行。那里有500000多个reddit帖子，从字面上看要花几年的时间。

这是我正在做的事的一个例子。

var readStream = fs.createReadStream(location)
oboe(readStream)
    .done(async function(post) {
        let { parent_id, body, created_utc, score, subreddit } = data;
        let comment_id = data.name;

        // Checks if there is a comment with the comment id of this post's parent id in the table
        getParent(parent_id, function(parent_data) {
            // Checks if there is a comment with the same parent id, and then checks which one has higher score
            getExistingCommentScore(parent_id, function(existingScore) {

                // other code above but it isn't relevant for my question

                // this function adds the query I made to a table
                addToTransaction()

            })
        })
})

基本上，这是开始读取流，然后将其传递给名为oboe的模块。

然后我得到JSON作为回报。然后，它检查数据库中是否已经保存了父项，然后检查是否存在具有相同父项ID的现有注释。

我需要同时使用这两个函数才能获取所需的数据（仅获取“最佳”注释）

addToTransaction的样子如下：

function addToTransaction(query) {
    // adds the query to a table, then checks if the length of that table is 1000 or more

    if (length >= 1000) {
        connection.beginTransaction(function(err) {
            if (err) throw new Error(err);

            for (var n=0; n<transactions.length;n++) {
                let thisQuery = transactions[n];
                connection.query(thisQuery, function(err) {
                    if (err) throw new Error(err);
                })
            }

            connection.commit();
        })
    }
}

addToTransaction的作用是获取我所做的查询，然后将它们推送到表中，然后检查该表的长度，然后创建一个新事务，在for循环中执行所有这些查询，然后上班（保存）。

问题是，它是如此之慢，以至于我做的回调函数甚至都没有被调用。

（最后）我的问题是，有什么方法可以改善性能？

（如果您想知道为什么要这样做，那是因为我正在尝试创建聊天机器人）

我知道我已经发布了很多东西，但是我试图为您提供尽可能多的信息，以便您有更好的机会来帮助我。感谢您的回答，我将回答您的问题。

将大型JSON文件保存到MySQL时，性能更高

0 个答案: