Question

我有以下代码：

client = MongoClient()
data_base = client.hkpr_restore
agents_collection = data_base.agents
agent_ids = agents_collection.find({},{"_id":1})

这给了我一个结果：

{u'_id': ObjectId('553020a8bf2e4e7a438b46d9')}
{u'_id': ObjectId('553020a8bf2e4e7a438b46da')}
{u'_id': ObjectId('553020a8bf2e4e7a438b46db')}

我如何获得ObjectId，以便我可以使用每个ID搜索另一个集合？

Answer 1

使用distinct

In [27]: agent_ids = agents_collection.find().distinct('_id')

In [28]: agent_ids
Out[28]: 
[ObjectId('553662940acf450bef638e6d'),
 ObjectId('553662940acf450bef638e6e'),
 ObjectId('553662940acf450bef638e6f')]

In [29]: agent_id2 = [str(id) for id in agents_collection.find().distinct('_id')]

In [30]: agent_id2
Out[30]: 
['553662940acf450bef638e6d',
 '553662940acf450bef638e6e',
 '553662940acf450bef638e6f']

Answer 2

尝试使用_ids创建列表理解，如下所示：

>>> client = MongoClient()
>>> data_base = client.hkpr_restore
>>> agents_collection = data_base.agents
>>> result = agents_collection.find({},{"_id":1})
>>> agent_ids = [x["_id"] for x in result]
>>> 
>>> print agent_ids
[ ObjectId('553020a8bf2e4e7a438b46d9'),  ObjectId('553020a8bf2e4e7a438b46da'),  ObjectId('553020a8bf2e4e7a438b46db')]
>>>

Answer 3

我想添加一些比查询所有_id更通用的内容。

import bson
[...]
results = agents_collection.find({}})
objects = [v for result in results for k,v in result.items()
          if isinstance(v,bson.objectid.ObjectId)]

上下文：将对象保存在gridfs中会创建ObjectId，以检索所有对象以进行进一步查询，此功能帮助了我。

Answer 4

我通过遵循此answer解决了问题。在查找语法中添加提示，然后简单地遍历返回的游标。

db.c.find({},{_id:1}).hint(_id:1);

我猜测如果没有提示，光标将在迭代时将整个文档取回，从而导致迭代非常慢。有了提示，光标将只返回ObjectId，并且迭代将很快完成。

背景是我正在从事ETL作业，该作业需要将一个mongo集合同步到另一个mongo集合，同时按某些条件修改数据。对象ID的总数约为 1亿。

我尝试使用distinct，但出现以下错误：

Error in : distinct too big, 16mb cap

我尝试使用聚合，并像其他类似问题一样回答了$ group。只是遇到一些内存消耗错误。

如何使用pymongo获取ObjectId的列表？

4 个答案: