Question

我有一个这样的字段集合：

{
    "_id":"5cf54857bbc85fd0ff5640ba",
    "book_id":"5cf172220fb516f706d00591",
    "tags":{
        "person":[
            {"start_match":209, "length_match":6, "word":"kimmel"}
        ],
        "organization":[
            {"start_match":107, "length_match":12, "word":"philadelphia"},
            {"start_match":209, "length_match":13, "word":"kimmel center"}
        ],
        "location":[
            {"start_match":107, "length_match":12, "word":"philadelphia"}
        ]
    },
    "deleted":false
}

我想收集类别中的不同单词并将其计数。因此，输出应如下所示：

{
    "response": [
        {
            "tag": "location",
            "tag_list": [
                {
                    "count": 31,
                    "phrase": "philadelphia"
                },
                {
                    "count": 15,
                    "phrase": "usa"
                }
             ]
        },
        {
            "tag": "organization",
            "tag_list": [ ... ]
        },
        {
            "tag": "person",
            "tag_list": [ ... ]
        },
    ]
}

这样的管道有效：

def pipeline_func(tag):
    return [
        {'$replaceRoot': {'newRoot': '$tags'}},
        {'$unwind': '${}'.format(tag)},
        {'$group': {'_id': '${}.word'.format(tag), 'count': {'$sum': 1}}},
        {'$project': {'phrase': '$_id', 'count': 1, '_id': 0}},
        {'$sort': {'count': -1}}
    ]

但是它要求每个标签。我想知道如何在一个请求中做到这一点。谢谢您的关注。

Answer 1

如前所述，由于$unwind只能仅用于数组，因此问题数据与当前声明的管道处理过程存在轻微不匹配，问题中显示的tags不是数组。

对于问题中显示的数据，您基本上需要这样的管道：

db.collection.aggregate([
  { "$addFields": {
    "tags": { "$objectToArray": "$tags" }
  }},
  { "$unwind": "$tags" },
  { "$unwind": "$tags.v" },
  { "$group": {
    "_id": {
      "tag": "$tags.k",
      "phrase": "$tags.v.word"
    },
    "count": { "$sum": 1 }
  }},
  { "$group": {
    "_id": "$_id.tag",
    "tag_list": {
      "$push": {
        "count": "$count",
        "phrase": "$_id.phrase"
      }
    }
  }}
])

再次根据注释，由于tags实际上是对象，因此根据其子键收集数据所需的实际条件正如问题所要解决的那样，就是将其本质上转变为项目的数组。

在您当前的管道中使用$replaceRoot似乎表明$objectToArray在这里合理使用，因为它可以从MongoDB 3.4的更高修补程序版本中获得，是您现在应该在生产中使用的最低版本。

实际上，$objectToArray所做的几乎与名称中所说的一样，并且产生了被分解为 key 的条目的数组（或“ list”，更像是 pythonic ）和 value 对。这些本质上是 object （或“ dict”条目）的“列表”，分别具有键k和v。在提供的文档中，第一个管道阶段的输出如下所示：

{
  "book_id": "5cf172220fb516f706d00591",
  "tags": [
    {
      "k": "person",
      "v": [
        {
          "start_match": 209,
          "length_match": 6,
          "word": "kimmel"
        }
      ]
    }, {
      "k": "organization",
      "v": [
        {
          "start_match": 107,
          "length_match": 12,
          "word": "philadelphia"
        }, {
          "start_match": 209,
          "length_match": 13,
          "word": "kimmel center"
        }
      ]
    }, {
      "k": "location",
      "v": [
        {
          "start_match": 107,
          "length_match": 12,
          "word": "philadelphia"
        }
      ]
    }
  ],
  "deleted" : false
}

因此，您应该能够看到如何轻松访问这些k值并将其用于分组，当然v是标准数组好。因此，只显示了两个 $unwind阶段，然后是两个 $group阶段。作为要收集键组合的第一个$group，第二个要按主分组键收集，同时将其他累加添加到“列表” < / em>在该条目中。

当然，上面列出的输出与您在问题中的要求完全不同，但数据基本上就在那里。您可以选择添加$addFields或$project阶段，以将_id键重命名为最终聚合阶段：

{ "$addFields": { "_id": "$$REMOVE", "tag": "$_id" }}

或者干脆在光标输出上做一些 pythonic 的操作：

cursor = db.collection.aggregate([ { "$addFields": { "tags": { "$objectToArray": "$tags" } }}, { "$unwind": "$tags" }, { "$unwind": "$tags.v" }, { "$group": { "_id": { "tag": "$tags.k", "phrase": "$tags.v.word" }, "count": { "$sum": 1 } }}, { "$group": { "_id": "$_id.tag", "tag_list": { "$push": { "count": "$count", "phrase": "$_id.phrase" } } }} ]) output = [{ 'tag': doc['_id'], 'tag_list': doc['tag_list'] } for doc in cursor] print({ 'response': output });

最终输出为“列表” ，您可以将其用于response：

{ "tag_list": [ { "count": 1, "phrase": "philadelphia" } ], "tag": "location" }, { "tag_list": [ { "count": 1, "phrase": "kimmel" } ], "tag": "person" }, { "tag_list": [ { "count": 1, "phrase": "kimmel center" }, { "count": 1, "phrase": "philadelphia" } ], "tag": "organization" }

请注意，使用 list comprehension 方法，您可以更好地控制“键”作为输出的顺序，因为MongoDB本身只需在追加新的键名即可投影，使现有键优先排序。如果那种事情对您很重要，那就是。尽管实际上不应该这样，因为不应将所有类似Object / Dict的结构都视为具有键的任何设置顺序。这就是数组（或列表）的用途。

如何通过对所有元素进行计数并按一个请求进行分组来进行pymongo聚合

1 个答案: