如何使用Pymongo Aggregation查询此数据集?

时间:2015-11-10 22:01:16

标签: mongodb aggregation-framework

这是我第一次使用MongoDB聚合查询。我的数据集如下:

{ // doc 1
    "_id" : ObjectId("55f2481bc9b4cd1c0c198c9f"),
    "channels" : [ 
        "channel_3", 
        "channel_2", 
        "channel_1", 
        "channel_4"
    ],
    "msd" : 25,
    "uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632"
}
{ // doc 2
    "_id" : ObjectId("55f2481bc9b4cd123452345f"),
    "channels" : [ 
        "channel_3", 
        "channel_4"
    ],
    "msd" : 50,
    "uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632"
}
{ // doc 3
    "_id" : ObjectId("55f2481bc9b4cd1c0c198c9f"),
    "channels" : [  
        "channel_2"
    ],
    "msd" : 100,
    "uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632"
}
{ // doc 4
    "_id" : ObjectId("55f2481bc9b4cd1c0c198c9f"),
    "channels" : [  
        "channel_2"
    ],
    "msd" : 80,
    "uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632"
}

我已经构建了一个复合索引:

userlog.create_index([('uid', ASCENDING), ('channels', ASCENDING)])

现在,给定一个用户和一个频道数组,我想检索其中至少有一个频道位于查询频道中的msd的平均值。 例如,查询是:

{"uid" : "000012bb-2e5a-8bd3-d36a-fa037973e632", "channels" : ["channel_1", "channel_2"], }

doc 1的频道包含“channel_1”和“channel_2”,doc 3和4的频道包含“channels_2”。所以预期的回报值是(25 + 100 + 80)/ 3 = 68.33

======================试用1 ==================== ===

CODE:

pipe=[ 
    {"$unwind":'$channels'},
    {"$match":{'uid':"000012bb-2e5a-8bd3-d36a-fa037973e632", 'channels':{'$in':channels}}},
    {"$group":{'_id': '$channels', 'averageMSD':{'$avg':'$msd'}}}
    ]

for res in db.aggregate(pipeline=pipe):
    print(res)

结果:

{'_id': 'channel_1', 'averageMSD': 25.0}
{'_id': 'channel_2', 'averageMSD': 68.33333333333333}

似乎“$ unwind”使得doc 1意外地被计算两次。另外,“$ unwind”非常慢。

======================试用2 ==================== ===

CODE:

pipe=[ 
    {"$match":{'uid':"000012bb-2e5a-8bd3-d36a-fa037973e632", 'channels':{'$in':channels}}},
    {"$group":{'_id': '$channels', 'averageMSD':{'$avg':'$msd'}}}
    ]

for res in db.aggregate(pipeline=pipe):
    print(res)

结果:

{'averageMSD': 90.0, '_id': ['channel_2']}
{'averageMSD': 25.0, '_id': ['channel_3', 'channel_2', 'channel_1', 'channel_4']}

结果仍然不是我想要的。似乎我不应该通过“渠道”对结果进行分组。但我不知道如何解决它。

如何使用聚合有效地查询数据库?

1 个答案:

答案 0 :(得分:0)

我明白了:

pipe=[ 
    {"$match":{'uid':"000012bb-2e5a-8bd3-d36a-fa037973e632", 'channels':{'$in':channels}}},
    {"$group":{'_id': None, 'averageMSD':{'$avg':'$msd'}}}
    ]