Question

我有一个按预期工作的MEAN应用程序，Angular可以从我的MongoDB中提取数据，Express处理API等等。

我希望将RSS源中的数据导入到我的数据库中，因为它会导入到RSS源中。我最初让我的应用程序在加载页面时从RSS提要中提取JSON，但是每次刷新页面时，我都会从RSS数据中添加重复数据。是继续在页面刷新时提取Feed的最佳方法，并检查__id是否已存在于数据库中？或者有更好的方法将RSS数据的消耗合并到我的数据库中。

这是我的应用结构：

/ app 后端内容
- / controllers 所有用于CRUD的猫鼬控件
- / models mongoose架构/模型
- route.js 快递路线
/ config 数据库配置文件
/ node_modules 节点模块
/ public 所有前端AngularJS的东西
- 的 / CSS
- 的 / JS
  - / controllers 角度控制器
  - / services 角度服务/工厂
  - app.js 将角度组件绑定在一起
  - appRoutes.js 前端路由
- / libs 角度生成的库
- / views html

这样的东西会进入我的/app/controllers/reviews.js吗？

var mongoose = require('mongoose');
var Review = require('../models/review');

// equivalent to "Create" in CRUD
exports.getAllFromFeed = function(req, res) {
    // pull RSS feed
    // create Review object from JSON
    // check for duplicate in database
    // add to mongodb
}

然后在页面加载时调用它？

Answer 1

我不得不跳过你的其他问题，以了解你在这里问的是什么。你的一般情况似乎归结为几件事：

如何定期最好地执行任务。
如何避免在更新时从Feed中添加重复数据

所以基本上这里最好的办法是管理通过MongoDB＆＃34; upserts＆＃34;加载到集合中的feed数据，这应该只在不存在某些东西时创建新文档。但要做到这一点，您需要稍微操纵从Feed中收到的内容，或者主要是仅使用默认的_id作为Feed中的唯一标识符。

以下是节点中有一些助手的基本过程：

var async = require('async'),
    time = require('time'),
    CronJob = require('cron').CronJob,
    mongoose = require('mongoose'),
    Schema = mongoose.Schema,
    FeedParser = require('feedparser'),
    request = require('request');

mongoose.connect('mongodb://localhost/test');

var feedSchema = new Schema({
  _id: String
},{ strict: false });

var Feed = mongoose.model('Feed',feedSchema);

var job = new CronJob({
  cronTime: '0 0-59 * * * *',

  onTick: function() {

    var req = request('https://itunes.apple.com/us/rss/customerreviews/id=662900426/sortBy=mostRecent/xml'),
        feedparser = new FeedParser();

    var bulk = Feed.collection.initializeUnorderedBulkOp();

    req.on('error',function(err) {
      throw err;
    });

    req.on('response',function(res) {
      var stream = this;

      if (res.statusCode != 200) {
        return this.emit('error', new Error('Bad status code'));
      } else {
        console.log("res OK");
      }

      stream.pipe(feedparser);

    });

    feedparser.on('error',function(err) {
      throw err;
    });

    feedparser.on('readable',function() {

      var stream = this,
          meta = this.meta,
          item;

      while ( item = stream.read() ) {
        item._id = item.guid;
        delete item.guid;
        bulk.find({ _id: item._id }).upsert().updateOne({ "$set": item });
      }

    });

    feedparser.on('end',function() {
      console.log('at end');
      bulk.execute(function(err,response) {
        // Shouldn't be one as errors should be in the response
        // but just in case there was a problem connecting the op
        if (err) throw err;

        // Just dumping the response for demo purposes
        console.log( JSON.stringify( response, undefined, 4 ) );

      });
    });

  },
  start: true
});

mongoose.connection.on('open',function(err,db) {
  job.start();
});

我先提到的一些事情。这里的Schema定义使用strict:false，主要是因为我不想指定所有字段，但我已经为我处理了mongoose。 _id有一个定义为＆＃34; String＆＃34;但是，这样就可以为＆＃34; id＆＃34;您将从Feed数据中使用是否正确。

这个的一般内容是建立在＆＃34; cron＆＃34; job，使用该节点模块。这设置了一个定期的工作＆＃34;按指定的计划运行。我在这里使用的时间是每分钟，只是为了演示。

其他部分实施＆＃34; feedparser＆＃34;模块功能，其中对内容发出请求，然后通过feedparser将其用于您可以使用的数据。 Yo可以选择在外部设置该部分，但只能在＆＃34; job＆＃34;在这里定义为例子。

为了将数据放入MongoDB，我在这里使用批量操作API。您不必这样做，但它确实通过您稍后获得的写入响应更清楚地了解正在发生的事情。否则一般的猫鼬方法用＆＃34; upsert＆＃34;指定将执行，例如.findByIdAndUpdate()。

在解析器流可读时触发的事件中发生这种情况。每个.read()请求都会返回当前的＆＃34;项目＆＃34;来自饲料。为了让一切快乐，我们改变了＃guid＆＃34;原始字段数据中的字段为_id字段。然后你只需设置＆＃34; upsert＆＃34;请求。在批量操作中，这只是排队＆＃34;请求。

最后，最后执行批量操作，从而发送到服务器。在这里，我们检查响应以查看实际发生的情况。

在＆＃34; job＆＃34;的定义之外，这只是包含在＆＃34;开始＆＃34;仅当与数据库的连接可用时才执行作业。通常很好的做法，但如果使用猫鼬模型方法进行＆＃34; upserts＆＃34;那么猫鼬应该排队＆＃34;排队＆＃34;操作直到连接完成。

现在发生的事情是这个工作应该在启动时启动，因为它是如何定义的，每分钟工作将再次运行，请求提要数据和＆＃34; upserting＆＃34;它。第一次运行时，空白集合上写入响应的实际输出将是这样的：

{
    "ok": 1,
    "writeErrors": [],
    "writeConcernErrors": [],
    "nInserted": 0,
    "nUpserted": 51,
    "nMatched": 0,
    "nModified": 0,
    "nRemoved": 0,
    "upserted": [
        {
            "index": 0,
            "_id": "https://itunes.apple.com/us/app/cox-contour-for-ipad/id662900426?mt=8&uo=2"
        },
        {
            "index": 1,
            "_id": "1024220540"
        },
        {
            "index": 2,
            "_id": "1023922797"
        },
        {
            "index": 3,
            "_id": "1023784213"
        },
        {
            "index": 4,
            "_id": "1023592061"
        }
    ]
}

依此类推，但是在Feed中返回了很多项目，因为这些项目是新插入到集合中的。但是当下一个＆＃34; tick＆＃34;运行：

{
    "ok": 1,
    "writeErrors": [],
    "writeConcernErrors": [],
    "nInserted": 0,
    "nUpserted": 0,
    "nMatched": 51,
    "nModified": 0,
    "nRemoved": 0,
    "upserted": []
}

由于没有任何新内容，实际上没有任何改变，它只是报告项目是匹配的＆＃34;并且实际上并没有做任何其他事情来修改＆＃34;或＆＃34;插入＆＃34;。 MongoDB通常足够聪明，只要使用 $set 运算符就可以了解这一点。

如果某些内容确实在Feed中的数据中发生了变化，那么它将被修改为＆＃34;在不同的数据或＆＃34; upserted＆＃34;如果饲料中存在新物品。

根据需要改变，但有一种方法可以定期设置，并且在决定是否插入数据库之前，还要避免检查数据库中是否存在任何项目。

将RSS源导入MongoDB

1 个答案: