使用Python 3和SQLite的批量插入性能不佳

时间:2018-09-21 21:12:13

标签: python sqlite

我只有几个包含URL的文本文件。我正在尝试创建一个SQLite数据库以将这些URL存储在一个表中。 URL表具有两列,即主键(INTEGER)和URL(TEXT)。

我尝试在一个插入命令中插入100,000个条目,然后循环直到完成URL列表。基本上,读取所有文本文件的内容并保存在列表中,然后使用创建较小的100,000个条目的列表并插入表中。

文本文件中的URL总数为4,591,415,文本文件的总大小约为97.5 MB。

问题

  1. 当我选择文件数据库时,大约需要 7-7.5分钟插入。考虑到我的固态硬盘具有更快的读/写能力,我觉得这不是一个很快的插入操作。除此之外,任务管理器中还提供了大约10GB的RAM。处理器是i5-6300U 2.4Ghz。

  2. 文本文件总数约为。 97.5兆字节但是在我将网址插入SQLite之后,SQLite数据库大约为350MB,即原始数据大小的3.5倍。由于数据库不包含任何其他表,索引等,因此该数据库的大小看起来有点奇怪。

对于问题1,我尝试使用参数,并根据具有不同参数的测试运行提出了最佳参数。

table, th, td {
    border: 1px solid black;
    border-collapse: collapse;
}
th, td {
    padding: 15px;
    text-align: left;
}
<table style="width:100%">
<tr> 
<th>Configuration</th>
<th>Time</th>    
</tr>
  
<tr><th>50,000 - with journal = delete and no transaction                           </th><th>0:12:09.888404</th></tr>
<tr><th>50,000 - with journal = delete and with transaction                         </th><th>0:22:43.613580</th></tr>
<tr><th>50,000 - with journal = memory and transaction                              </th><th>0:09:01.140017</th></tr>
<tr><th>50,000 - with journal = memory                                              </th><th>0:07:38.820148</th></tr>
<tr><th>50,000 - with journal = memory and synchronous=0                            </th><th>0:07:43.587135</th></tr>
<tr><th>50,000 - with journal = memory and synchronous=1 and page_size=65535        </th><th>0:07:19.778217</th></tr>
<tr><th>50,000 - with journal = memory and synchronous=0 and page_size=65535        </th><th>0:07:28.186541</th></tr>
<tr><th>50,000 - with journal = delete and synchronous=1 and page_size=65535        </th><th>0:07:06.539198</th></tr>
<tr><th>50,000 - with journal = delete and synchronous=0 and page_size=65535        </th><th>0:07:19.810333</th></tr>
<tr><th>50,000 - with journal = wal and synchronous=0 and page_size=65535           </th><th>0:08:22.856690</th></tr>
<tr><th>50,000 - with journal = wal and synchronous=1 and page_size=65535           </th><th>0:08:22.326936</th></tr>
<tr><th>50,000 - with journal = delete and synchronous=1 and page_size=4096         </th><th>0:07:35.365883</th></tr>
<tr><th>50,000 - with journal = memory and synchronous=1 and page_size=4096         </th><th>0:07:15.183948</th></tr>
<tr><th>1,00,000 - with journal = delete and synchronous=1 and page_size=65535      </th><th>0:07:13.402985</th></tr>



</table>

我正在网上检查,看到此链接https://adamyork.com/2017/07/02/fast-database-inserts-with-python-3-6-and-sqlite/,该系统的运行速度比我慢得多,但仍然运行良好。 从此链接中脱颖而出的两件事是:

  1. 链接中的表比我的列多。
  2. 数据库文件没有增长3.5倍。

我在这里共享了python代码和文件:https://github.com/ksinghgithub/python_sqlite

有人可以指导我优化此代码。谢谢。

环境:

  1. i5-6300U上的Windows 10 Professional,20GB RAM和512 SSD。
  2. Python 3.7.0

编辑1 ::新性能图表基于对UNIQUE约束收到的反馈,并且我正在使用缓存大小值。

self.db.execute('CREATE TABLE blacklist (id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, url TEXT NOT NULL UNIQUE)')

table, th, td {
    border: 1px solid black;
    border-collapse: collapse;
}
th, td {
    padding: 15px;
    text-align: left;
}
<table>
<tr> 
<th>Configuration</th>
<th>Action</th>
<th>Time</th>    
<th>Notes</th>
</tr>
<tr><th>50,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:18.011823</th><th>Size reduced to 196MB from 350MB</th><th></th></tr>
<tr><th>50,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:25.692283</th><th>Size reduced to 196MB from 350MB</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 </th><th></th><th>0:07:13.402985</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 4096</th><th></th><th>0:04:47.624909</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th></th><<th>0:03:32.473927</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:17.927050</th><th>Size reduced to 196MB from 350MB</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default   </th><th>REMOVE UNIQUE FROM URL</th><th>0:00:21.804679</th><th>Size reduced to 196MB from 350MB</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default   </th><th>REMOVE UNIQUE FROM URL & ID</th><th>0:00:14.062386</th><th>Size reduced to 134MB from 350MB</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default   </th><th>REMOVE UNIQUE FROM URL & DELETE ID</th><th>0:00:11.961004</th><th>Size reduced to 134MB from 350MB</th><th></th></tr>

</table>

2 个答案:

答案 0 :(得分:1)

SQLite默认使用自动提交模式。这允许省略begin transaction。但是在这里,我们希望所有插入操作都在事务中,唯一的方法是从begin transaction开始事务,以便所有要运行的语句都在该事务中。

方法executemany只是在Python外部完成的execute循环,仅调用SQLite prepare语句函数一次。

以下是从列表中删除最后N个项目的非常糟糕的方法:

    templist = []
    i = 0
    while i < self.bulk_insert_entries and len(urls) > 0:
        templist.append(urls.pop())
        i += 1

最好这样做:

   templist = urls[-self.bulk_insert_entries:]
   del urls[-self.bulk_insert_entries:]
   i = len(templist)

切片和del切片即使在一个空列表上也可以工作。

这两种方法可能都具有相同的复杂性,但是与让Python在解释器之外进行操作相比,进行10万次追加和弹出调用的成本要高得多。

答案 1 :(得分:0)

“ URL”列上的UNIQUE约束正在URL上创建隐式索引。这可以解释尺寸的增加。

我认为您不能填充表格,然后再添加唯一约束。

您的瓶颈肯定是CPU。请尝试以下操作:

  1. 安装工具z:pip install toolz
  2. 使用此方法:

    from toolz import partition_all
    
    def add_blacklist_url(self, urls):
        # print('add_blacklist_url:: entries = {}'.format(len(urls)))
        start_time = datetime.now()
        for batch in partition_all(100000, urls):
            try:
                start_commit = datetime.now()
                self.cursor.executemany('''INSERT OR IGNORE INTO blacklist(url) VALUES(:url)''', batch)
                end_commit = datetime.now() - start_commit
                print('add_blacklist_url:: total time for INSERT OR IGNORE INTO blacklist {} entries = {}'.format(len(templist), end_commit))
            except sqlite3.Error as e:
                print("add_blacklist_url:: Database error: %s" % e)
            except Exception as e:
                print("add_blacklist_url:: Exception in _query: %s" % e)
        self.db.commit()
        time_elapsed = datetime.now() - start_time
        print('add_blacklist_url:: total time for {} entries = {}'.format(records, time_elapsed))
    

该代码未经测试。