Question

我正在尝试创建一个超过200GB的（单个）数据库文件（将定期更新/偶尔部分重新创建/偶尔查询），因此在我看来相对较大。大约有16k桌子，它们的大小从几kb到~1gb不等。他们有2-21列。最长的表有近1500万行。

我编写的脚本逐个浏览输入文件，进行一系列处理和正则表达式以获取可用数据。它定期发送一个批处理（0.5-1GB），用sqlite3编写，每个表插入一个单独的executemany语句，插入数据。这些执行语句之间没有提交或创建表语句等，所以我相信所有这些都属于单个事务

最初这个脚本的工作速度足以达到我的目的，但随着时间的推移，它会在接近完成时显着减慢 - 这使得我需要进一步降低速度，以便在正常使用笔记本电脑时保持内存使用的可管理性是不幸的。

我做了一些快速的基准测试，比较将相同的样本数据插入空数据库，而不是插入200GB数据库。后来的测试执行插入语句的速度慢了约3倍（相对速度提交甚至更糟，但从绝对意义上说它无关紧要） - 除此之外没有显着差异

当我在returned results for indexes slowing down inserts on large tables之前研究这个主题时。答案似乎是没有索引的表上的插入应该保持大致相同的速度，无论大小如何;因为我不需要对这个数据库运行大量查询，所以我没有制作任何索引。我甚至仔细检查并检查了索引，如果我说得对，应该将其排除在原因之外：

c.execute('SELECT name FROM sqlite_master WHERE type="index"')

print(c.fetchone()) #returned none

出现的另一个问题是交易，但我不知道如何只针对同一个脚本写入大型数据库并写入相同的数据。

缩写相关代码：

#process pre defined objects, files, retrieve data in batch - 
#all fine, no slowdown on full database

conn = sqlite3.connect(db_path)

c = conn.cursor()

table_breakdown=[(tup[0]+'-'+tup[1],tup[0],tup[1]) for tup in all_tup] # creates list of tuples
# (tuple name "subject-item", subject, item)

targeted_create_tables=functools.partial(create_tables,c) #creates new table if needed
#for new subjects/items- 
list(map(targeted_create_tables,table_breakdown)) #no slowdown on full database

targeted_insert_data=functools.partial(insert_data,c) #inserts data for specific 
#subject item combo

list(map(targeted_insert_data,table_breakdown)) # (3+) X slower

conn.commit() # significant relative slowdown, but insignificant in absolute terms
conn.close()

和相关的插入功能：

def insert_data(c,tup):
    global collector ###list of tuples of data for a combo of a subject and item
    global sql_length ###pre defined dictionary translating the item into the 
    #right length (?,?,?...) string
    tbl_name=tup[0]
    subject=tup[1]
    item=tup[2]
    subject_data=collector[subject][item]
    if not (subject_data==[]):

        statement='''INSERT INTO "{0}" VALUES {1}'''.format(tbl_name,sql_length[item])

        c.executemany(statement,subject_data)#massively slower, about 80% of 
    #inserts > twice slower

        subject_data=[]

编辑：每个CL请求的表创建函数。我知道这是低效的（检查表名是否以这种方式存在以创建表的时间大致相同）但是对于减速并不重要。

def create_tables(c,tup):
    global collector
    global title #list of column schemes to match to items
    tbl_name=tup[0]
    bm_unit=tup[1]
    item=tup[2]
    subject_data=bm_collector[bm_unit][item]

    if not (subject_data==[]):
        c.execute('SELECT * FROM sqlite_master WHERE name = "{0}" and type="table"'.format(tbl_name))
        if c.fetchone()==None:
            c.execute('CREATE TABLE "{0}" {1}'.format(tbl_name,title[item]))

标题词中有65个不同的列方案，但这是他们看起来的例子：

title.append(('WINDFOR','(TIMESTAMP TEXT, SP INTEGER, SD TEXT, PUBLISHED TEXT, WIND_CAP NUMERIC, WIND_FOR NUMERIC)'))

任何人都有任何关于在哪里寻找或可能导致此问题的想法？如果我遗漏了重要信息或遗漏了一些非常基本的东西，我会道歉，我完全冷落这个话题区。

Answer 1

将行添加到表的末尾是插入数据的最快方式（并且您不使用rowid玩游戏，因此您确实会附加到结尾）。

但是，您没有使用单个表而是使用16k表，因此管理表结构的开销成倍增加。

尝试增加cache size。但最有希望的改变是使用更少的表格。

Answer 2

对我来说，INSERT的时间随着数据库大小的增加而增加。打开/关闭/写入较大文件时，操作系统本身可能会较慢。当然，索引可能会使事情变得更加缓慢，但这并不意味着没有索引就不会放缓。

随着数据库的增长，插入速度会逐渐减慢（无索引）

2 个答案: