Question

我最近开始通过shell和PyMongo测试MongoDB。我注意到返回游标并尝试迭代它似乎是实际迭代中的瓶颈。有没有办法在迭代期间返回多个文档？

伪代码：

for line in file:
    value = line[a:b]
    cursor = collection.find({"field": value})
    for entry in cursor:
        (deal with single entry each time)

我希望做的是这样的事情：

for line in file
    value = line[a:b]
    cursor = collection.find({"field": value})
    for all_entries in cursor:
        (deal with all entries at once rather than iterate each time)

我已经尝试按照this question使用batch_size（）并将值一直更改为1000000，但它似乎没有任何影响（或者我做错了）。 / p>

非常感谢任何帮助。请关注这个Mongo新手！

---编辑---

谢谢Caleb。我想你已经指出了我真正想要问的问题，这就是：有没有办法做一个collection.findAll()或cursor.fetchAll()排序命令，就像cx_Oracle模块一样？问题不在于存储数据，而是尽可能快地从Mongo DB中检索数据。

据我所知，数据返回给我的速度由我的网络决定，因为Mongo必须单次获取每条记录，对吗？

Answer 1

您是否考虑过以下方法：

for line in file
  value = line[a:b]
  cursor = collection.find({"field": value})
  entries = cursor[:] # or pull them out with a loop or comprehension -- just get all the docs
  # then process entries as a list, either singly or in batch

或者，例如：

# same loop start
  entries[value] = cursor[:]
# after the loop, all the cursors are out of scope and closed
for value in entries:
  # process entries[value], either singly or in batch

基本上，只要你有足够的RAM来存储你的结果集，你就应该能够将它们从光标中拉出并在处理之前保持它们。这可能不会明显加快，但它可以减轻游标的任何减速，并且如果你已经为此设置了它，可以让你自由地并行处理你的数据。

Answer 2

您也可以尝试：

results = list(collection.find({'field':value}))

这应该将所有内容加载到RAM中。

或许，如果你的file不是太大：

values = list()
for line in file:
    values.append(line[a:b])
results = list(collection.find({'field': {'$in': values}}))

Answer 3

toArray()可能是一个解决方案。基于文档，它首先遍历Mongo上的所有游标，并且仅以数组的形式返回结果一次。

http://docs.mongodb.org/manual/reference/method/cursor.toArray/

这与list(coll.find())或[doc for doc in coll.find()]不同，后者一次向Python提取一个文档并返回Mongo并获取下一个光标。

但是，这个方法没有在pyMongo上实现......奇怪

Answer 4

如上所述@jmelesky，我总是遵循同样的方法。这是我的示例代码。用于存储我的光标twts_result，声明下面的列表进行复制。如果可以存储数据，请使用RAM。如果您从您获取数据的集合中不需要处理和更新，则解决游标超时问题。

我在这里收集来自收藏的推文。

twts_result = maindb.economy_geolocation.find({}, {'_id' : False})
print "Tweets for processing -> %d" %(twts_result.count())

tweets_sentiment = []
batch_tweets = []
#Copy the cursor data into list
tweets_collection = list(twts_result[:])
for twt in tweets_collection:
    #do stuff here with **twt** data

PyMongo - 游标迭代

4 个答案: