Question

我做了一些诊断，发现htop：

python save_to_db.py占用了86％的CPU
postgres: mydb mydb localhost idle in transaction占用了16％的CPU。

save_to_db.py的代码类似于：

import datetime
import django
import os
import sys
import json
import itertools
import cProfile

# setting up standalone django environment
...

from django.db import transaction
from xxx.models import File

INPUT_FILE = "xxx"
with open("xxx", "r") as f:
    volume_name = f.read().strip()

def todate(seconds):
    return datetime.datetime.fromtimestamp(seconds)

@transaction.atomic
def batch_save_files(files, volume_name):
    for jf in files:
        metadata = files[jf]
        f = File(xxx=jf, yyy=todate(metadata[0]), zzz=todate(metadata[1]), vvv=metadata[2], www=volume_name)
        f.save()

with open(INPUT_FILE, "r") as f:
    dirdump = json.load(f)

timestamp = dirdump["curtime"]
files = {k : dirdump["files"][k] for k in list(dirdump["files"].keys())[:1000000]}

cProfile.run('batch_save_files(files, volume_name)')

相应的cProfile转储（我只保留了具有大的cumtime的那些）：

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  881.336  881.336 <string>:1(<module>)
  1000000    5.325    0.000  844.553    0.001 base.py:655(save)
  1000000   14.574    0.000  834.125    0.001 base.py:732(save_base)
  1000000   10.108    0.000  800.494    0.001 base.py:795(_save_table)
  1000000    5.265    0.000  720.608    0.001 base.py:847(_do_update)
  1000000    4.522    0.000  446.781    0.000 compiler.py:1038(execute_sql)
  1000000   23.669    0.000  196.273    0.000 compiler.py:1314(as_sql)
  1000000    7.473    0.000  458.064    0.000 compiler.py:1371(execute_sql)
        1    0.000    0.000  881.336  881.336 contextlib.py:49(inner)
  1000000    7.370    0.000   62.090    0.000 lookups.py:150(process_lhs)
  1000000    3.907    0.000   81.685    0.000 lookups.py:159(as_sql)
  1000000    3.251    0.000   44.679    0.000 lookups.py:74(process_lhs)
  1000000    3.594    0.000   53.745    0.000 manager.py:81(manager_method)
  1000000   19.855    0.000  106.487    0.000 query.py:1117(build_filter)
  1000000    5.523    0.000  161.104    0.000 query.py:1241(add_q)
  1000000   10.684    0.000  152.080    0.000 query.py:1258(_add_q)
  1000000    7.448    0.000  513.984    0.001 query.py:697(_update)
  1000000    2.221    0.000  201.359    0.000 query.py:831(filter)
  1000000    5.371    0.000  199.138    0.000 query.py:845(_filter_or_exclude)
        1    7.982    7.982  881.329  881.329 save_to_db.py:47(batch_save_files)
  1000000    1.834    0.000  204.064    0.000 utils.py:67(execute)
  1000000    3.099    0.000  202.231    0.000 utils.py:73(_execute_with_wrappers)
  1000000    4.306    0.000  199.131    0.000 utils.py:79(_execute)
  1000000   10.830    0.000  222.880    0.000 utils.py:97(execute)
      2/1    0.000    0.000  881.336  881.336 {built-in method builtins.exec}
  1000001  189.750    0.000  193.764    0.000 {method 'execute' of 'psycopg2.extensions.cursor' objects}

运行time python save_to_db.py需要14分钟，大约每1000次插入/秒。这很慢。

File的架构如下：

xxx  TEXT UNIQUE NOT NULL PRIMARY KEY
yyy  DATETIME
zzz  DATETIME
vvv  INTEGER
www  TEXT

我似乎无法弄清楚如何加快这个过程。有没有办法做到这一点，我不知道？目前我将所有内容编入索引，但如果这是主要瓶颈，我会非常惊讶。

谢谢！

Answer 1

您可以使用bulk create。

objs = [
    File(
        xxx=jf,
        yyy=todate(metadata[0]),
        zzz=todate(metadata[1]),
        vvv=metadata[2],
        www=volume_name
    )
    for jf in files
]
filelist = File.objects.bulk_create(objs)

Django + PostgreSQL - 1000次插入/秒，如何加速？

1 个答案: