我想从大型csv.gzip文件中删除记录。 我正在使用this响应中描述的块生成器:
def gen_chunks(reader, chunksize=100):
"""
Chunk generator. Take a CSV `reader` and yield
`chunksize` sized slices.
"""
chunk = []
for i, line in enumerate(reader):
if (i % chunksize == 0 and i > 0):
yield chunk
del chunk[:]
chunk.append(line)
yield chunk
我通过以下命令运行守护进程Mongod:
$ mongod --dbpath data\db
然后使用pymongo启动python脚本:
with gzip.open(filepath, 'rt', newline='') as gzip_file:
dr = csv.DictReader(gzip_file) # comma is default delimiter
chunksize = 10 ** 3
for chunk in gen_chunks(dr, chunksize):
bulk = locations.initialize_ordered_bulk_op()
for row in chunk:
cell = {
'mcc': int(row['mcc']),
'mnc': int(row['net']),
'lac': int(row['area']),
'cell': int(row['cell'])
}
location = {
'lat': float(row['lat']),
'lon': float(row['lon'])
}
bulk.find(cell).upsert().update({'$set': {'OpenCellID': location}})
result = bulk.execute()
然后进程使用的RAM内存增加(对于屏幕截图,我的母语很抱歉,RAM是第三列): 完成脚本执行后(upserting大约3000万个文档),mongod使用的内存大约达到15 GB!
我做错了什么/误会?
P.S。重启守护进程后RAM内存减少到正常值(约30 Mb)。