在mongodb中,我有一个文档集合,其中包含一组记录,我希望通过类似的标签对记录进行分组,以保持自然顺序
{
"day": "2019-01-07",
"records": [
{
"tag": "ch",
"unixTime": ISODate("2019-01-07T09:06:56Z"),
"score": 1
},
{
"tag": "u",
"unixTime": ISODate("2019-01-07T09:07:06Z"),
"score": 0
},
{
"tag": "ou",
"unixTime": ISODate("2019-01-07T09:07:06Z"),
"score": 0
},
{
"tag": "u",
"unixTime": ISODate("2019-01-07T09:07:20Z"),
"score": 0
},
{
"tag": "u",
"unixTime": ISODate("2019-01-07T09:07:37Z"),
"score": 1
}
]
我想通过类似的标签序列对记录进行分组(并汇总),而不仅仅是通过对唯一标签进行分组
所需的输出:
{
"day": "2019-01-07",
"records": [
{
"tag": "ch",
"unixTime": [ISODate("2019-01-07T09:06:56Z")],
"score": 1
"nbRecords": 1
},
{
"tag": "u",
"unixTime": [ISODate("2019-01-07T09:07:06Z")],
"score": 0,
"nbRecords":1
},
{
"tag": "ou",
"unixTime": [ISODate("2019-01-07T09:07:06Z")],
"score": 0
},
{
"tag": "u",
"unixTime: [ISODate("2019-01-07T09:07:20Z"),ISODate("2019-01-07T09:07:37Z")]
"score": 1
"nbRecords":2
}
]
似乎mongodb中的'$ groupby'聚合运算符以前曾按唯一字段对数组和组进行排序
db.coll.aggregate(
[
{"$unwind":"$records"},
{"$group":
{
"_id":{
"tag":"$records.tag",
"day":"$day"
},
...
}
}
]
)
返回
{
"day": "2019-01-07",
"records": [
{
"tag": "ch",
"unixTime": [ISODate("2019-01-07T09:06:56Z")],
"score": 1
"nbRecords": 1
},
{
"tag": "u",
"unixTime": [ISODate("2019-01-07T09:07:06Z"),ISODate("2019-01-07T09:07:20Z"),ISODate("2019-01-07T09:07:37Z")],
"score": 2,
"nbRecords":3
},
{
"tag": "ou",
"unixTime": [ISODate("2019-01-07T09:07:06Z")],
"score": 0
},
]
由于我当前正在使用pymongo驱动程序,因此我在python中实现了该解决方案 使用itertools.groupby作为生成器执行尊重自然顺序的分组,但是由于疯狂的时间处理,我面临服务器超时问题(cursor.NotFound Error)。
关于如何直接使用mongo的mapreduce功能的任何想法
执行与python中的itertools.groupby()
等效的功能?
非常感谢您的帮助:我正在使用pymongo驱动程序3.8和MongoDB 4.0
答案 0 :(得分:0)
Ni!在记录数组中运行,添加一个新的整数索引,每当groupby目标更改时,该索引就会递增,然后对该索引使用mongo操作。 。〜´
答案 1 :(得分:0)
在@Ale的推荐下,在MongoDb中没有做任何提示。我切换回解决cursor.NotFound问题的python实现。
我想我可以在Mongodb中完成工作,但这正在解决
for r in db.coll.find():
session = [
]
for tag, time_score in itertools.groupby(r["records"], key=lambda x:x["tag"]):
time_score = list(time_score)
session.append({
"tag": tag,
"start": time_score[0]["unixTime"],
"end": time_score[-1]["unixTime"],
"ca": sum([n["score"] for n in time_score]),
"nb_records": len(time_score)
})
db.col.update(
{"_id":r["_id"]},
{
"$unset": {"records": ""},
"$set":{"sessions": session}
})