我的收藏中有以下文件。每个文档都包含推文的文本和从推文中挑选出来的实体数组(使用AWS Comprehend):
{
"text" : "some tweet by John Smith in New York about Stack Overflow",
"entities" : [
{
"Type" : "ORGANIZATION",
"Text" : "stack overflow"
},
{
"Type" : "LOCATION",
"Text" : "new york"
},
{
"Type" : "PERSON",
"Text" : "john smith"
}
]
},
{
"text" : "another tweet by John Smith but this one from California and about Google",
"entities" : [
{
"Type" : "ORGANIZATION",
"Text" : "google"
},
{
"Type" : "LOCATION",
"Text" : "california"
},
{
"Type" : "PERSON",
"Text" : "john smith"
}
]
}
我想获得一份不同的entities.Text
列表,按entities.Type
分组,并计算每个entities.Text
的出现次数,如下所示:
{ "_id" : "ORGANIZATION", "values" : [ {text:"stack overflow",count:1},{text:"google",count:1} ] }
{ "_id" : "LOCATION", "values" : [ {text:"new york",count:1},{text:"california",count:1} ] }
{ "_id" : "PERSON", "values" : [ {text:"john smith",count:2} ] }
我可以按entities.Type
进行分组,并将所有entities.Text
放入包含此查询的数组中:
db.collection.aggregate([
{
$unwind: '$entities'
},
{
$group: {
_id: '$entities.Type',
values: {
$push: '$entities.Text'
}
}
}])
导致此输出包含重复值且无计数。
{ "_id" : "ORGANIZATION", "values" : [ "stack overflow", "google" ] }
{ "_id" : "LOCATION", "values" : [ "new york", "california" ] }
{ "_id" : "PERSON", "values" : [ "john smith", "john smith" ] }
我开始沿着使用$project
作为聚合的最后一步并添加带有javascript函数的计算字段valuesMap
的路径。但后来我意识到你不能在聚合管道中编写javascript。
我的下一步是使用普通的javascript处理mongoDB输出,但我希望(为了学习)使用mongoDB查询完成所有这些。
谢谢!
答案 0 :(得分:4)
您可以尝试以下查询。您需要额外的$group
来推送计数和文字。
db.collection.aggregate(
[
{"$unwind":"$entities"},
{"$group":{
"_id":{"type":"$entities.Type","text":"$entities.Text"},
"count":{"$sum":1}
}},
{"$group":{
"_id":"$_id.type",
"values":{"$push":{"text":"$_id.text","count":"$count"}}
}}
])
答案 1 :(得分:0)
db.collection.aggregate(
// Pipeline
[
// Stage 1
{
$unwind: {
path: '$entities'
}
},
// Stage 2
{
$group: {
_id: {
Text: '$entities.Text'
},
count: {
$sum: 1
},
Type: {
$addToSet: '$entities.Type'
}
}
},
// Stage 3
{
$group: {
_id: {
Type: '$Type'
},
values: {
$addToSet: {
text: '$_id.Text',
count: '$count'
}
}
}
},
// Stage 4
{
$project: {
values: 1,
_id: {
$arrayElemAt: ['$_id.Type', 0]
}
}
}
]
);