Question

我想在MongoDB中插入1500000个文档。首先，我查询一个数据库，并从那里获取15000名教师的列表，对于每位教师，我希望每个教师插入100门课程。

我运行两个循环：首先，它循环遍历所有讲师，其次，在每次迭代中，它将为该id插入100个文档，如以下代码所示：

const instructors = await Instructor.find();
//const insrtuctor contains 15000 instructor
instructors.forEach((insructor) => {
    for(let i=0; i<=10; i++) {
        const course = new Course({
            title: faker.lorem.sentence(),
            description: faker.lorem.paragraph(),
            author: insructor._id,
            prise: Math.floor(Math.random()*11),
            isPublished: 'true',
            tags: ["java", "Nodejs", "javascript"]
        });
        course.save().then(result => {
            console.log(result._id);
            Instructor.findByIdAndUpdate(insructor._id, { $push: { courses: course._id } })
            .then(insructor => {
                console.log(`Instructor Id : ${insructor._id} add Course : ${i} `);
            }).catch(err => next(err));
            console.log(`Instructor id: ${ insructor._id } add Course: ${i}`)
        }).catch(err => console.log(err));
    }
});

这是我的package.json文件，我将在互联网上找到的东西放在其中：

{
    "scripts": {
        "start": "nodemon app.js",
        "fix-memory-limit": "cross-env LIMIT=2048 increase-memory-limit"
    },
    "devDependencies": {
        "cross-env": "^5.2.0",
        "faker": "^4.1.0",
        "increase-memory-limit": "^1.0.6",
    }
}

这是我的课程模型定义

const mongoose = require('mongoose');

const Course = mongoose.model('courses', new mongoose.Schema({

title: {
    type: String,
    required: true,
    minlength: 3
},
author: {
    type: mongoose.Schema.Types.ObjectId,
    ref: 'instructor'
},
description: {
    type: String,
    required: true,
    minlength: 5
},
ratings: [{
    user: {
        type: mongoose.Schema.Types.ObjectId,
        ref: 'users',
        required: true,
        unique: true
    },
    rating: {
        type: Number,
        required: true,
        min: 0,
        max: 5
    },
    description: {
        type: String,
        required: true,
        minlength: 5
    }
}],
tags: [String],
rating: {
    type: Number,
    min: 0,
    default: 0
},
ratedBy: {
    type: Number,
    min: 0,
    default: 0
},
prise: {
    type: Number,
    required: function() { this.isPublished },
    min: 0
},
isPublished: {
    type: Boolean,
    default: false
}
}));

module.exports = Course;

Answer 1

对于大数量的数据，您必须使用光标。

想法是在您从数据库中获得一个时，尽快处理文档尽快。

就像您要求数据库提供教员，并且数据库将发送给小批量，然后您对该批操作并进行处理<强>直到达到所有批次的结束。

否则 await Instructor.find()将加载所有数据 到内存并填充使用您不需要的猫鼬方法实例。

即使await Instructor.find().lean()也不会给记忆带来好处。

当您对集合执行find时，

游标是 mongodb的功能。

猫鼬可通过以下方式访问：Instructor.collection.find({})

观看this video。

下面，我编写了使用光标批量处理数据的解决方案。

将此添加到模块内的某个位置：

const createCourseForInstructor = (instructor) => {
  const data = {
    title: faker.lorem.sentence(),
    description: faker.lorem.paragraph(),
    author: instructor._id,
    prise: Math.floor(Math.random()*11), // typo: "prise", must be: "price"
    isPublished: 'true',
    tags: ["java", "Nodejs", "javascript"]
  };
  return Course.create(data);
}

const assignCourseToInstructor = (course, instructor) => {
  const where = {_id: instructor._id};
  const operation = {$push: {courses: course._id}};
  return Instructor.collection.updateOne(where, operation, {upsert: false});
}

const processInstructor = async (instructor) => {
  let courseIds = [];
  for(let i = 0; i < 100; i++) {
    try {
      const course = await createCourseForInstructor(instructor)
      await assignCourseToInstructor(course, instructor);
      courseIds.push(course._id);
    } 
    catch (error) {
      console.error(error.message);
    }
  }
  console.log(
    'Created ', courseIds.length, 'courses for', 
    'Instructor:', instructor._id, 
    'Course ids:', courseIds
  );
};

，然后在您的异步块中，将您的循环替换为：

const cursor = await Instructor.collection.find({}).batchSize(1000);

while(await cursor.hasNext()) {
  const instructor = await cursor.next();
  await processInstructor(instructor);
}

P.S。我使用本机collection.find和collection.updateOne来实现性能，以避免猫鼬使用多余的堆来实现猫鼬的方法和字段在模型实例上。

奖励：

即使如果具有此光标解决方案，您的代码也会再次出现内存不足问题再次< / strong>，运行，您的代码类似于本示例（根据服务器的内存定义大小（以兆字节为单位））

nodemon --expose-gc --max_old_space_size=10240 app.js

Answer 2

原因是您没有等待save返回的诺言，而是立即继续for和forEach循环的下一次迭代。这意味着您将启动大量（待执行）save操作，这确实会增加mongodb库的内存使用量。

在继续下一次迭代之前，最好等待save（和链接的findByIdAndUpdate）解决。

由于您显然处于async函数上下文中，因此可以使用await，只要您将forEach循环替换为for循环即可（您将保持在相同的函数上下文中）：

async function yourFunction() {
    const instructors = await Instructor.find();
    for (let instructor of instructors) { // Use `for` loop to allow for more `await`
        for (let i=0; i<10; i++) { // You want 10 times, right?
            const course = new Course({
                title: faker.lorem.sentence(),
                description: faker.lorem.paragraph(),
                author: instructor._id,
                prise: Math.floor(Math.random()*11),
                isPublished: 'true',
                tags: ["java", "Nodejs", "javascript"]
            });
            const result = await course.save();
            console.log(result._id);
            instructor = await Instructor.findByIdAndUpdate(instructor._id, { $push: { courses: course._id } });
            console.log(`Instructor Id : ${instructor._id} add Course : ${i}`);
        }
    }
}

现在所有save操作都已序列化：下一个操作仅在前一个操作完成时才开始。

请注意，我还没有包括您的错误处理：最好将catch调用链接到此async函数的调用。

JavaScript堆内存不足-插入mongodb时出错

2 个答案: