Question

到目前为止，我的查询看起来像这样：

 db.variants.aggregate({'$unwind':'$samples'},{$project:{"_id":0,"samples":1,"chr":1,"pos":1,"ref":1,"alt":1}});
    { "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "0|0", "GQ" : 48, "DP" : 1, "HQ" : [ 34, 1 ], "GTC" : 0, "sample_id" : "559de1b2aa43f47656b2a3fa"}
    { "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "1|0", "GQ" : 15, "DP" : 8, "HQ" : [ 5, 51 ], "GTC" : 1, "sample_id" : "559de1b2aa43f47656b2a3f9"}
    { "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "1/1", "GQ" : 43, "DP" : 5, "HQ" : [ 0, 2 ], "GTC" : 2, "sample_id" : "559de1b2aa43f"}
    { "chr" : "20", "pos" : 14371, "ref" : "A", "alt" : "G", "samples" : { "GT" : "0|0", "GQ" : 48, "DP" : 1, "HQ" : [ 51, 51 ], "GTC" : 0, "sample_id" : "559de1b2aa43f47656b2a3fa"}
    { "chr" : "20", "pos" : 14371, "ref" : "A", "alt" : "G", "samples" : { "GT" : "1|0", "GQ" : 48, "DP" : 8, "HQ" : [ 51, 51 ], "GTC" : 1, "sample_id" : "559de1b2aa43f47656b2a3f9"}

但在查询之外，我已经定义了这样的样本组：

 SID1=['559de1b2aa43f47656b2a3fa','559de1b2aa43f47656b2a3f9']
 SID2=['559de1b2aa43f']

但是当涉及到如何进行实际分组时，我陷入困境，因为我的sample_id数组实际上并不在文档中，因此普通的$ group运算符将无效。我想创建一个名为SID1和SID2的组，如果samples.study_id在SID1数组中（并且SID2相同），则对“samples.GTC”求和。我的预期输出是：

 { "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", SID1:1, SID2:2}
 { "chr" : "22", "pos" : 14371, "ref" : "A", "alt" : "G", SID1:1, SID2:0}

我猜它应该有点接近这个，但显然不是很明显：

db.variants.aggregate(
 {'$unwind':'$samples'},
 {$project:
   {"_id":0,"chr":1,"pos":1,"ref":1,"alt":1,"samples":1}
 },
 { "$group":{
    "_id":{"chr":"$chr","pos":"$pos","ref":"$ref","alt":"$alt"},
    "SID1" : {$cond:{if:{"$sample_id":{$in:SID1}},then:{$sum:"$GTC"},else:0}},
    "SID2" : {$cond:{if:{"$sample_id":{$in:SID2}},then:{$sum:"$GTC"},else:0}},
    }
 });

Answer 1

对于$cond来说，你是正确的，但语法有点错误，你也需要一些其他助手：

var SID1 = ['559de1b2aa43f47656b2a3fa','559de1b2aa43f47656b2a3f9'],
    SID2 = ['559de1b2aa43f'];

db.variants.aggregate([
    { "$unwind": "$samples" },
    { "$group": {
        "_id": {
           "chr": "$chr",
           "pos": "$pos",
           "ref": "$ref",
           "alt": "$alt"
        },
        "SID1": {
            "$sum": {
                "$cond": [
                    { "$setIsSubset": [
                        { "$map": { 
                            "input": { "$literal": ["A"] },
                            "as": "el",
                            "in": "$samples.sample_id"
                        }},
                        SID1
                    ]},
                    "$samples.GTC",
                    0
                ]
            }
        },
        "SID2": {
            "$sum": {
                "$cond": [
                    { "$setIsSubset": [
                        { "$map": { 
                            "input": { "$literal": ["A"] },
                            "as": "el",
                            "in": "$samples.sample_id"
                        }},
                        SID2
                    ]},
                    "$samples.GTC",
                    0
                ]
            }
        }
    }}
])

这就是结果：

{
    "_id" : {
            "chr" : "20",
            "pos" : 14371,
            "ref" : "A",
            "alt" : "G"
    },
    "SID1" : 1,
    "SID2" : 0
}
{
    "_id" : {
            "chr" : "22",
            "pos" : 14373,
            "ref" : "C",
            "alt" : "T"
    },
    "SID1" : 1,
    "SID2" : 2
}

所以$cond进入＆＃34;内部＆＃34; $sum因为那是一个＆＃34;累加器＆＃34;以及你如何在$group下构建。

在定义管道时直接使用变量名称没有任何问题，因为该值只是＆＃34;内插＆＃34;并被视为文字。但当然，因为这些是＆＃34;数组＆＃34;要比较它们。更重要的是，它们实际上是＆＃34;设置＆＃34;。

$setIsSubset运算符可以＆＃34;逻辑＆＃34;比较两个＆＃34;＆＃34;以便查看是否包含另一个的元素。这为true/false提供了合理的$cond。

然而＆＃34; samples.sample_id＆＃34;字段不是数组。但我们可以简单地将它变成一个＆＃34;通过使用$map运算符为其提供一个声明为单个元素的$literal数组并转置该值。

$map运算符与许多编程语言中的function of the same name完全相同，它在数组中作用于＆＃34;输入＆＃34;。它将每个数组元素作为声明变量从＆＃34;处理为＆＃34;通过处理＆＃34; in＆＃34;中的函数表达式。它返回一个与输入长度相同的数组，但结果由函数表达式应用。另一个例子：

{ "$map": {
    "input": { "$literal": [1,2,3,4] },   // input array
    "as": "el",                           // variable represents element
    "in": {                               
        "$multiply": [ "$$el", "$$el" ]   // square of element
    }
}

返回：

[1,4,9,16]                                // All array elements "squared"

自MongoDB 2.2以来，$literal运算符实际上已经出现了聚合框架，但是未记录的运算符$const。虽然前面提到过，注射＆＃34;注射＆＃34;如图所示，外部变量进入聚合管道，你不能做的一件事是＆＃34;返回＆＃34;该值作为文档的属性。作为表达式参数，在大多数情况下这很好，但是例如你不能这样做：

{ "$project": {
    "myfield": ["bill","ted","fred"]
}}

哪会导致错误，所以请执行以下操作：

{ "$project": {
    "myfield": { "$literal": ["bill","ted","fred"] }
}}

允许将字段设置为您想要的字段，即值数组。

因此，与列表中的$map组合，它只是一种表示管道中不存在的单个元素的数组的方式，以便转换＆＃34;转换＆＃34;它是当前字段的值。

它变成了这个：

"559de1b2aa43f47656b2a3fa"

通过代码：

{ "$map": {
    "input": { "$literal": ["A"] },
    "as": "el",
    "in": "$samples.sample_id"           // into this ["559de1b2aa43f47656b2a3fa"]
}}

这使得$setIsSubset操作在内部看起来像这样：

{ "$setIsSubset": [ 
    ["559de1b2aa43f47656b2a3fa"],
    ["559de1b2aa43f47656b2a3fa","559de1b2aa43f47656b2a3f9"]
}}                                                          // true

最终结果是比较每个变量以查看包含的值是否与其中一个元素匹配，以及相应的＆＃34;字段值＆＃34;被发送到$sum进行积累。

此外，删除$project，因为这通常由$group阶段处理，并且将其留在那里会导致处理开销，因为需要首先遍历管道中的每个文档。所以它并没有真正优化任何东西，而是让你付出代价。

顺便说一句。到目前为止，来自管道输出的样本数据缺少结束＆＃34;大括号＆＃34;。我在下面使用了这些数据（当然没有$ unwind在管道中）

 { "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "0|0", "GQ" : 48, "DP" : 1, "HQ" : [ 34, 1 ], "GTC" : 0, "sample_id" : "559de1b2aa43f47656b2a3fa"} },
 { "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "1|0", "GQ" : 15, "DP" : 8, "HQ" : [ 5, 51 ], "GTC" : 1, "sample_id" : "559de1b2aa43f47656b2a3f9"}},
 { "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "1/1", "GQ" : 43, "DP" : 5, "HQ" : [ 0, 2 ], "GTC" : 2, "sample_id" : "559de1b2aa43f"}},
 { "chr" : "20", "pos" : 14371, "ref" : "A", "alt" : "G", "samples" : { "GT" : "0|0", "GQ" : 48, "DP" : 1, "HQ" : [ 51, 51 ], "GTC" : 0, "sample_id" : "559de1b2aa43f47656b2a3fa"}},
 { "chr" : "20", "pos" : 14371, "ref" : "A", "alt" : "G", "samples" : { "GT" : "1|0", "GQ" : 48, "DP" : 8, "HQ" : [ 51, 51 ], "GTC" : 1, "sample_id" : "559de1b2aa43f47656b2a3f9"}}

使用Aggregation将字段与外部数组值匹配

1 个答案: