Question

我有两个集合 forms4InsTrader_final（200 万个文档）和 TradeData（1300 万个文档）。 我真的很难理解为什么 $out 没有保存聚合结果。

在以下聚合中有以下阶段：

第 1 阶段：特定日期范围之间的 $match 个日期。 {'pdOfRpt': {'$gte': '2004-01-01', '$lte': '2020-12-31' }}
第 2 阶段：将 $lookup 加入 (forms4InsTrader_final) 到 TradeData {'from': 'aprl_test_Trade', 'localField': 'issuertradingsymbol', 'foreignField': 'ticker','as': 'string'}
阶段 3：$unwind 上面的“字符串”
第 4 阶段：然后匹配同一文档中的日期 {'$expr': {'$eq': [ '$pdOfRpt', '$string.Date_unmodified']}}
第 5 阶段：$unwind
第 6 阶段：使用 $project
选择我需要分析的几个字段
第 7 阶段：使用 $out
保存结果

在上述所有步骤中 - 除了 Stage 7 之外，这两个集合的一切都如预期的一样快。但是，我想将此结果保存为单独的集合。它已经运行了三个多小时，我对大约 100 万个文档的结果有限，但我没有看到结果保存在不同的集合中。有趣的是，当我对 $limit 的 20000 个文档运行此查询时，它会在不到一分钟的时间内得到保存。我不明白为什么用 $out 保存大约 100 万个文档的结果需要这么长时间。我在这里错过了什么？

请注意，我尝试在本地使用带有指南针和/或终端的可视化查询构建器。

完整的管道：

`db.forms4InsTrader_final.aggregate([     {         '$match': {             'pdOfRpt': {                 '$gte': '2004-01-01',                  '$lte': '2020-12-31'             }         }     }, {         '$lookup': {             'from': 'TradeData',              'localField': 'issuertradingsymbol',              'foreignField': 'ticker',              'as': 'string'         }     }, {         '$unwind': {             'path': '$string',              'includeArrayIndex': 'Date_unmodified'         }     }, {         '$match': {             '$expr': {                 '$eq': [                     '$pdOfRpt', '$string.Date_unmodified'                 ]             }         }     }, {         '$project': {             'string.Adj Close': 1,              'string.Volume': 1,              'string.Close': 1,              'string.avg_Week_Vol': 1,              'string.db.forms4InsTrader_final.aggregate([     {         '$match': {             'pdOfRpt': {                 '$gte': '2004-01-01',                  '$lte': '2020-12-31'             }         }     }, {         '$lookup': {             'from': 'TradeData',              'localField': 'issuertradingsymbol',              'foreignField': 'ticker',              'as': 'string'         }     }, {         '$unwind': {             'path': '$string',              'includeArrayIndex': 'Date_unmodified'         }     }, {         '$match': {             '$expr': {                 '$eq': [                     '$pdOfRpt', '$string.Date_unmodified'                 ]             }         }     }, {         '$project': {             'string.Adj Close': 1,              'string.Volume': 1,              'string.Close': 1,              'string.avg_Week_Vol': 1,              'string.avg_Week_Adj_Close_Price': 1,              'string.Date_unmodified': 1,              'pdOfRpt': 1,              'issuercik': 1,              'issuertradingsymbol': 1,              'reportingownerid_rptownercik': 1,              'reportingowneraddress_rptownerzipcode': 1,              'reportingownerrelationship_isdirector': 1,              'reportingownerrelationship_isofficer': 1,              'reportingownerrelationship_istenpercentowner': 1,              'reportingownerrelationship_isother': 1,              'nonderivativetransaction_securitytitle_value': 1,              'nonderivativetransaction_transactionamounts_transactionshares_value': 1,              'nonderivativetransaction_transactionamounts_transactionpricepershare_value': 1,              'nonderivativetransaction_transactionamounts_transactionacquireddisposedcode_value': 1,              'nonderivativetransaction_posttransactionamounts_sharesownedfollowingtransaction_value': 1,              'derivativetransaction_securitytitle_value': 1,              'derivativetransaction_transactionamounts_transactionshares_value': 1,              'derivativetransaction_transactionamounts_transactionpricepershare_value': 1,              'derivativetransaction_transactionamounts_transactionacquireddisposedcode_value': 1,              'derivativetransaction_ownershipnature_directorindirectownership_value': 1,              'derivativetransaction_underlyingsecurity_underlyingsecuritytitle_value': 1,              'derivativetransaction_underlyingsecurity_underlyingsecurityshares_value': 1,              'derivativetransaction_posttransactionamounts_sharesownedfollowingtransaction_value': 1         }     }, {         '$limit': 1000000     }, {         '$out': 'TestAPril20'     } ])`

hacky 方式 - 所以指南针生成临时集合，您可以从中导出并作为单独的集合重新导入。非常低效，但嘿，在我找到另一个解决方案之前，我一直在手动操作

Answer 1

我依稀记得这是我遇到的解决方案，请告诉我allowDiskUse是否能解决您的问题

来自 mongodb 文档

<块引用>

管道阶段的 RAM 限制为 100 兆字节。如果一个阶段超过此限制，MongoDB 将产生错误。为了允许处理大型数据集，使用 allowDiskUse 选项启用将数据写入临时文件的聚合管道阶段。

这是一个配置标志，所以这是一个如何使用它的示例

db.stocks.aggregate( [
      { $project : { cusip: 1, date: 1, price: 1, _id: 0 } },
      { $sort : { cusip : 1, date: 1 } }
   ],
   { allowDiskUse: true }
)

优化慢聚合 mongo 查询

1 个答案: