Question

我有一个1,000,000 x 50 Pandas DataFrame我正在使用以下内容写入SQL表：

df.to_sql('my_table', con, index=False)

这需要非常长的时间。我已经看到了关于如何在线加速这个过程的各种解释，但它们似乎都不适用于MSSQL。

如果我尝试使用以下方法：

Bulk Insert A Pandas DataFrame Using SQLAlchemy

然后我收到no attribute copy_from错误。
如果我尝试使用多线程方法：

http://techyoubaji.blogspot.com/2015/10/speed-up-pandas-tosql-with.html

然后我收到QueuePool limit of size 5 overflow 10 reach, connection timed out错误。

有没有简单的方法可以将to_sql（）加速到MSSQL表？可以通过BULK COPY或其他方法，但完全来自Python代码？

Answer 1

我已经使用ctds进行批量插入，使用SQL服务器的速度要快得多。在下面的示例中，df是pandas DataFrame。 DataFrame中的列序列与mydb的架构相同。

import ctds

conn = ctds.connect('server', user='user', password='password', database='mydb')
conn.bulk_insert('table', (df.to_records(index=False).tolist()))

Answer 2

即使我遇到了同样的问题，所以我使用了sqlalchemy并快速执行了许多操作。

from sqlalchemy import event, create_engine
engine = create_egine('connection_string_with_database')
@event.listens_for(engine, 'before_cursor_execute')
def plugin_bef_cursor_execute(conn, cursor, statement, params, context,executemany):
   if executemany:
       cursor.fast_executemany = True  # replace from execute many to fast_executemany.
       cursor.commit()

始终确保给定的功能应该出现在引擎变量之后和光标执行之前。

conn = engine.execute()
df.to_sql('table', con=conn, if_exists='append', index=False) # for reference go to the pandas to_sql documentation.

Answer 3

在pandas 0.24中，您可以使用方法='multi'，其块大小为1000，这是sql服务器的限制

chunksize = 1000，方法=“多”

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method

0.24.0版中的新功能。

参数方法控制使用的SQL插入子句。可能的值为：

无：使用标准SQL INSERT子句（每行一个）。 'multi'：在单个INSERT子句中传递多个值。它使用并非所有后端都支持的特殊SQL语法。这通常为Presto和Redshift之类的分析数据库提供更好的性能，但如果表包含许多列，则对传统SQL后端的性能却较差。有关更多信息，请查看SQLAlchemy文档。

加速Pandas to_sql（）？

3 个答案: