Question

我有一些大型文件（每个约200至500 kb）。

每个文档都包含一个子文档数组。在每个子文档中，有一个我需要搜索的数组。

我需要构建一个接口，使我能够获取单个子文档。

考虑到我无法重构文档模型并且不知道目标所在的父文档，实现这一目标的最聪明和最快的方法是什么？

我想举例说明我尝试过的内容，但我甚至在努力解决＆＃34的基本概念;如何搜索每个子文档数组以寻找我需要的内容＆＃34;，原谅缺乏这样的。

父文档如下所示：

{
"name":"Foobar",
"subs":[
    {           
        "imageName":"name",
        "foreignNames":[
            {
                // This is the field I need to search through
            }
        ]
    }
]
}

Answer 1

由于在返回文档和仅选择的＆＃34;子文档之间存在明显差异＆＃34;详细信息，并且您有嵌套数组，最好的方法是使用aggregate()方法：

因此，如果将以下文件视为样本：

{
    "_id" : ObjectId("5380709ab5caa8c27c8a1392"),
    "name" : "Foobar",
    "subs" : [
        {
            "imageName" : "name",
            "foreignNames" : [
                {
                    "tagname" : "value"
                },
                {
                    "tagname" : "notvalue"
                }
            ]
        }
    ]
}

然后聚合语句是：

db.collection.aggregate([
    // Actually match the documents containing the matched value
    { "$match": {
        "subs.foreignNames.tagname": "value"
    }},

    // Unwind both of your arrays
    { "$unwind": "$subs" },
    { "$unwind": "$subs.foreignNames" },

    // Now filter only the matching array element
    { "$match": {
        "subs.foreignNames.tagname": "value"
    }},

    // Group back one level of data         
    { "$group": {
        "_id": {
            "_id": "$_id",
            "name": "$name",
            "imageName": "$subs.imageName"
        },
        "foreignNames": { "$push": "$subs.foreignNames" }
    }},

    // Group back to the original level
    { "$group": {
        "_id": "$_id._id",
        "name": { "$first": "$_id.name" },
        "subs": {
            "$push": { 
                "imageName": "$_id.imageName",
                "foreignNames": "$foreignNames"
            }
        }
    }}
])

结果将是：

{
    "_id" : ObjectId("5380709ab5caa8c27c8a1392"),
    "name" : "Foobar",
    "subs" : [
        {
            "imageName" : "name",
            "foreignNames" : [
                {
                    "tagname" : "value"
                }
            ]
        }
    ]
}

这样做的好处是，如果你可能会有多个比赛甚至多个比赛，那么说＆＃34; subs＆＃34;中的其他项目，那么这实际上将它们全部保持在一起为您筛选不匹配的结果。

如果您实际上并不需要，并且只想要那个特定的＆＃34;文件＆＃34;或者只是整个文档中的特定字段，然后您可以在$group`阶段缩短，只需$project所需的结果：

db.newdoc.aggregate([
    { "$match": {
        "subs.foreignNames.tagname": "value"
    }},
    { "$unwind": "$subs" },
    { "$unwind": "$subs.foreignNames" },
    { "$match": {
        "subs.foreignNames.tagname": "value"
    }},
    { "$project": {
        "_id": 0,
        "matched": "$subs.foreignNames"
    }}
])

举个例子，但是会返回：

{ "matched" : { "tagname" : "value" } }

这就是处理事情的方法。

注意：就在您提出问题之前，为什么$match语句在管道中进行了两次，这在评论中有所解释，但是这里有重点。

即使10,000个文档的集合中只有1个文档实际上具有匹配条件的内部数组文档，但在执行任何此数组展开之前执行此$match是有意义的。

这只是因为即使您要稍后将其过滤到1个结果，您不想做的是$unwind所有10,000个文档及其数组可能包含100,000个或更多条目然后通过搜索来查找1.您希望将其减少到尽可能小的集合，并丢弃任何永远不会包含所需子文档的文档。

此外，正如已经提到的，在聚合管道的初始阶段使用$match是唯一机会，您可以选择索引以提高查询性能。一旦开始解构/重新构建文档，索引就不再可用了。

首先索引，即：

db.collection.ensureIndex({ "subs.foreignNames.tagname": 1 })

Answer 2

这将在数组foreignNames中搜索标记tagname的所有文档（将其替换为您要检查的标记），其值为value。

db.collection.find({"subs.foreignNames.tagname":"value"})

您可以使用以下命令为此搜索添加索引。有关索引（和限制）的更多信息，请参阅the documentation。

db.collection.ensureIndex({"subs.foreignNames.tag":1})

MongoDB - 返回大父项的子文档

2 个答案: