Question

≈105秒每100万行插入Postgresql本地数据库的表上有2个索引和4列它是慢还是快？

Python代码：

import os 
import pandas as pd 
from concurrent.futures import ThreadPoolExecutor, as_completed
from sqlalchemy import create_engine

num =  32473068
batch = 1000000

def main(data):
    engine = create_engine('postgresql://***:****' + host + ':5432/kaggle')
    data.to_sql(con=engine, name=tbl_name, if_exists='append', index=False)

for i in range(0, num, batch):
    data = pd.read_csv(data_path+'app_events.csv', skiprows=i, nrows=batch)
    data.columns = ['event_id', 'app_id', 'is_installed', 'is_active']
    data = data.reset_index(drop=True)
    batchSize = 10000
    batchList = [data.iloc[x:x + batchSize].reset_index(drop=True) for x in range(0, len(data), batchSize)]
    with ThreadPoolExecutor(max_workers=30) as executor:
        future_to_url = {executor.submit(main, d): d for d in batchList}
        for k, future in enumerate(as_completed(future_to_url)):
            url = future_to_url[future]

Answer 1

这也取决于你的硬件。作为参考，我的老式I5笔记本电脑使用~300s插入0.1M行（大约200-300兆字节）。

我从其他类似问题中了解到，使用insert（）命令将大值拆分为大量可以加速。由于您正在使用Pandas，我认为它已经具有某些优化功能。但我建议你做一个快速测试，看它是否也有帮助。

Pandas实际上使用了非优化的插入命令。见（to_sql + sqlalchemy + copy from + postgresql engine?）。因此，应使用批量插入或其他方法来提高性能。
SQLalchemy 1.2在使用＆＃34; use_batch_mode = True＆＃34;初始化引擎时使用批量插入参数。我在I5 + HDD笔记本电脑上看到了100倍的加速！ 0.1M记录的含义，最初花了我300s，现在是3s !!如果你的计算机比我的好，我打赌你可以看到你的1M记录带来的巨大加速。

从Python将数据插入Postgresql

1 个答案: