Question

我必须在MongoDB中记录这些记录。我使用了一个简单的逻辑，但它没有用。请帮我解决这个问题。

from pymongo import MongoClient
import json
import sys
import os
client = MongoClient('localhost', 9000)
db1 = client['Com_Crawl']
collection1 = db1['All']
posts1 = collection1.posts
ll=[]
f=file(sys.argv[1],'r')
for i in f:
    j=json.loads(i)
    ll.append(j)
#print ll
print len(ll)
count = 0
for l in ll:
    count = count+1
    if count <= 10000:
        print count,l
        print posts1.update({'vtid':l},{'$set': {'processed': 0}},upsert = True,multi = True)
print "**** Success ***"

该文件包含1000万条记录。上面的代码插入了一个新列，并将其值更新为“0”以存储10000条记录。但是如何在每次执行的10000批次中记录其余记录。

Answer 1

你可以这样做。

for l in ll:
    for post in posts1.find({}).skip(count*10000).limit(10000):
        print post.update({'vtid':l},{'$set': {'processed': 0}},upsert = True,multi = True)
    count += 1
print "**** Success ***"

skip()完全符合您的想法，它会跳过查询集中的许多条目，然后limit()将结果限制为10000.所以基本上您正在使用{{1获取以0,10000,20000等开头的条目，并限制在该起点后仅获得10000。

Answer 2

Mongodb具有批量更新操作，可以批量更新数据库。你可以添加任何一个dict并且可以一次更新，但它在批量refer this内部更新1000 x 1000以了解有序和无序批量操作，refer this以了解批量更新refer this以了解批量操作的工作原理。因此，如果您关注批量更新，那么

from pymongo import MongoClient
client = MongoClient('localhost', 9000)
db1 = client['Com_Crawl']
collection1 = db1['All']
posts1 = collection1.posts
bulk = collection1.posts.initialize_unordered_bulk_op()
ll=[]
f=file(sys.argv[1],'r')
for i in f:
    j=json.loads(i)
    ll.append(j)
#print ll
print len(ll)
count = 0
for index,l in enumerate(ll):
   bulk.find({'vtid':l}).update({'$set': {'processed': 0}},upsert = True,multi = True)
    if (index+1)%10000 == 0:
        bulk.execute() #this updates the records and prints the status.
        bulk = collection1.posts.initialize_unordered_bulk_op() #reinitialise for next set of operations.
bulk.execute() #this updates the remaining last records.

如 Joe D 所指出，您也可以跳过记录并批量更新。

Mongo DB，Python：每10000条记录Upsert。

2 个答案: