我做了一些诊断,发现htop
:
python save_to_db.py
占用了86%的CPU postgres: mydb mydb localhost idle in transaction
占用了16%的CPU。 save_to_db.py
的代码类似于:
import datetime
import django
import os
import sys
import json
import itertools
import cProfile
# setting up standalone django environment
...
from django.db import transaction
from xxx.models import File
INPUT_FILE = "xxx"
with open("xxx", "r") as f:
volume_name = f.read().strip()
def todate(seconds):
return datetime.datetime.fromtimestamp(seconds)
@transaction.atomic
def batch_save_files(files, volume_name):
for jf in files:
metadata = files[jf]
f = File(xxx=jf, yyy=todate(metadata[0]), zzz=todate(metadata[1]), vvv=metadata[2], www=volume_name)
f.save()
with open(INPUT_FILE, "r") as f:
dirdump = json.load(f)
timestamp = dirdump["curtime"]
files = {k : dirdump["files"][k] for k in list(dirdump["files"].keys())[:1000000]}
cProfile.run('batch_save_files(files, volume_name)')
相应的cProfile转储(我只保留了具有大的cumtime的那些):
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 881.336 881.336 <string>:1(<module>)
1000000 5.325 0.000 844.553 0.001 base.py:655(save)
1000000 14.574 0.000 834.125 0.001 base.py:732(save_base)
1000000 10.108 0.000 800.494 0.001 base.py:795(_save_table)
1000000 5.265 0.000 720.608 0.001 base.py:847(_do_update)
1000000 4.522 0.000 446.781 0.000 compiler.py:1038(execute_sql)
1000000 23.669 0.000 196.273 0.000 compiler.py:1314(as_sql)
1000000 7.473 0.000 458.064 0.000 compiler.py:1371(execute_sql)
1 0.000 0.000 881.336 881.336 contextlib.py:49(inner)
1000000 7.370 0.000 62.090 0.000 lookups.py:150(process_lhs)
1000000 3.907 0.000 81.685 0.000 lookups.py:159(as_sql)
1000000 3.251 0.000 44.679 0.000 lookups.py:74(process_lhs)
1000000 3.594 0.000 53.745 0.000 manager.py:81(manager_method)
1000000 19.855 0.000 106.487 0.000 query.py:1117(build_filter)
1000000 5.523 0.000 161.104 0.000 query.py:1241(add_q)
1000000 10.684 0.000 152.080 0.000 query.py:1258(_add_q)
1000000 7.448 0.000 513.984 0.001 query.py:697(_update)
1000000 2.221 0.000 201.359 0.000 query.py:831(filter)
1000000 5.371 0.000 199.138 0.000 query.py:845(_filter_or_exclude)
1 7.982 7.982 881.329 881.329 save_to_db.py:47(batch_save_files)
1000000 1.834 0.000 204.064 0.000 utils.py:67(execute)
1000000 3.099 0.000 202.231 0.000 utils.py:73(_execute_with_wrappers)
1000000 4.306 0.000 199.131 0.000 utils.py:79(_execute)
1000000 10.830 0.000 222.880 0.000 utils.py:97(execute)
2/1 0.000 0.000 881.336 881.336 {built-in method builtins.exec}
1000001 189.750 0.000 193.764 0.000 {method 'execute' of 'psycopg2.extensions.cursor' objects}
运行time python save_to_db.py
需要14分钟,大约每1000次插入/秒。这很慢。
File
的架构如下:
xxx TEXT UNIQUE NOT NULL PRIMARY KEY
yyy DATETIME
zzz DATETIME
vvv INTEGER
www TEXT
我似乎无法弄清楚如何加快这个过程。有没有办法做到这一点,我不知道?目前我将所有内容编入索引,但如果这是主要瓶颈,我会非常惊讶。
谢谢!
答案 0 :(得分:4)
您可以使用bulk create。
objs = [
File(
xxx=jf,
yyy=todate(metadata[0]),
zzz=todate(metadata[1]),
vvv=metadata[2],
www=volume_name
)
for jf in files
]
filelist = File.objects.bulk_create(objs)