我一直在寻找方法来加速将数据帧推送到sql server并偶然发现方法here.这种方法在速度方面让我感到震惊。使用普通to_sql
花费了将近2个小时,这个脚本在12.54秒内完成,以推动100k行X 100列df。
因此,在使用示例df测试下面的代码之后,我尝试使用具有许多不同数据类型的df(int,string,float,Booleans)。不过,我很伤心看到一个内存错误。所以我开始减小我的df的大小,看看有什么限制。我注意到如果我的df有任何字符串,那么我就无法加载到sql server。我无法进一步隔离这个问题。下面的脚本取自链接中的问题,但是,我添加了一个带字符串的小df。关于如何纠正这个问题的任何建议都会很棒!
import pandas as pd
import numpy as np
import time
from sqlalchemy import create_engine, event
from urllib.parse import quote_plus
import pyodbc
conn = "DRIVER={SQL Server};SERVER=SERVER_IP;DATABASE=DB_NAME;UID=USER_ID;PWD=PWD"
quoted = quote_plus(conn)
new_con = 'mssql+pyodbc:///?odbc_connect={}'.format(quoted)
engine = create_engine(new_con)
@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
print("FUNC call")
if executemany:
cursor.fast_executemany = True
table_name = 'fast_executemany_test'
df1 = pd.DataFrame({'col1':['tyrefdg','ertyreg','efdgfdg'],
'col2':['tydfggfdgrefdg','erdfgfdgfdgfdgtyreg','edfgfdgdfgdffdgfdg']
})
s = time.time()
df1.to_sql(table_name, engine, if_exists = 'replace', chunksize = None)
print(time.time() - s)
答案 0 :(得分:8)
I was able to reproduce your issue using pyodbc 4.0.23. The MemoryError
was related to your use of the ancient
DRIVER={SQL Server}
Further testing using
DRIVER=ODBC Driver 11 for SQL Server
also failed, with
Function sequence error (0) (SQLParamData)
which was related to an existing pyodbc issue on GitHub. I posted my findings here.
That issue is still under investigation. In the meantime you might be able to proceed by
DRIVER=ODBC Driver 13 for SQL Server
, andpip install pyodbc==4.0.22
to use an earlier version of pyodbc.答案 1 :(得分:0)
我在 32 位上遇到了这个问题,并将我的中断器切换到 64 位,这解决了我的内存问题。在该解决方案之外,我建议将您处理的数据量分块。您可以建立阈值,一旦达到阈值,您就可以处理该数据块并进行迭代,直到处理完所有数据。