使用SQLAlchemy批量插入Pandas DataFrame

时间:2015-08-13 20:30:42

标签: python pandas sqlalchemy

我有一些相当大的pandas DataFrames,我想使用新的批量SQL映射通过SQL Alchemy将它们上传到Microsoft SQL Server。 pandas.to_sql方法虽然不错,但速度很慢。

我在编写代码时遇到了麻烦......

我希望能够将此函数传递给我正在调用table的pandas DataFrame,我正在调用schema的模式名称,以及我正在调用的表名{ {1}}。理想情况下,该函数将1.)删除表,如果它已经存在。 2.)创建一个新表3.)创建一个mapper和4.)使用mapper和pandas数据批量插入。我被困在第3部分。

这是我的(诚然粗糙的)代码。我正在努力解决如何让mapper函数与我的主键一起工作。我真的不需要主键,但映射器功能需要它。

感谢您的见解。

name

10 个答案:

答案 0 :(得分:25)

我遇到类似的问题,pd.to_sql需要花费数小时才能上传数据。以下代码批量在几秒钟内插入相同的数据。

from sqlalchemy import create_engine
import psycopg2 as pg
#load python script that batch loads pandas df to sql
import cStringIO

address = 'postgresql://<username>:<pswd>@<host>:<port>/<database>'
engine = create_engine(address)
connection = engine.raw_connection()
cursor = connection.cursor()

#df is the dataframe containing an index and the columns "Event" and "Day"
#create Index column to use as primary key
df.reset_index(inplace=True)
df.rename(columns={'index':'Index'}, inplace =True)

#create the table but first drop if it already exists
command = '''DROP TABLE IF EXISTS localytics_app2;
CREATE TABLE localytics_app2
(
"Index" serial primary key,
"Event" text,
"Day" timestamp without time zone,
);'''
cursor.execute(command)
connection.commit()

#stream the data using 'to_csv' and StringIO(); then use sql's 'copy_from' function
output = cStringIO.StringIO()
#ignore the index
df.to_csv(output, sep='\t', header=False, index=False)
#jump to start of stream
output.seek(0)
contents = output.getvalue()
cur = connection.cursor()
#null values become ''
cur.copy_from(output, 'localytics_app2', null="")    
connection.commit()
cur.close()

答案 1 :(得分:16)

当时可能已经回答了这个问题,但我通过在此网站上整理不同答案并与SQLAlchemy的文档保持一致来找到解决方案。

  1. 该表必须已经存在于db1中;索引设置为auto_increment on。
  2. Class Current 需要与CSV中导入的数据帧和db1中的表格保持一致。
  3. 希望这可以帮助任何来到这里并希望快速混合Panda和SQLAlchemy的人。

    from urllib import quote_plus as urlquote
    import sqlalchemy
    from sqlalchemy import create_engine
    from sqlalchemy.ext.declarative import declarative_base
    from sqlalchemy import Column, Integer, String, Numeric
    from sqlalchemy.orm import sessionmaker
    import pandas as pd
    
    
    # Set up of the engine to connect to the database
    # the urlquote is used for passing the password which might contain special characters such as "/"
    engine = create_engine('mysql://root:%s@localhost/db1' % urlquote('weirdPassword*withsp€cialcharacters'), echo=False)
    conn = engine.connect()
    Base = declarative_base()
    
    #Declaration of the class in order to write into the database. This structure is standard and should align with SQLAlchemy's doc.
    class Current(Base):
        __tablename__ = 'tableName'
    
        id = Column(Integer, primary_key=True)
        Date = Column(String(500))
        Type = Column(String(500))
        Value = Column(Numeric())
    
        def __repr__(self):
            return "(id='%s', Date='%s', Type='%s', Value='%s')" % (self.id, self.Date, self.Type, self.Value)
    
    # Set up of the table in db and the file to import
    fileToRead = 'file.csv'
    tableToWriteTo = 'tableName'
    
    # Panda to create a lovely dataframe
    df_to_be_written = pd.read_csv(fileToRead)
    # The orient='records' is the key of this, it allows to align with the format mentioned in the doc to insert in bulks.
    listToWrite = df_to_be_written.to_dict(orient='records')
    
    metadata = sqlalchemy.schema.MetaData(bind=engine,reflect=True)
    table = sqlalchemy.Table(tableToWriteTo, metadata, autoload=True)
    
    # Open the session
    Session = sessionmaker(bind=engine)
    session = Session()
    
    # Inser the dataframe into the database in one bulk
    conn.execute(table.insert(), listToWrite)
    
    # Commit the changes
    session.commit()
    
    # Close the session
    session.close()
    

答案 2 :(得分:10)

基于@ansonw答案:

def to_sql(engine, df, table, if_exists='fail', sep='\t', encoding='utf8'):
    # Create Table
    df[:0].to_sql(table, engine, if_exists=if_exists)

    # Prepare data
    output = cStringIO.StringIO()
    df.to_csv(output, sep=sep, header=False, encoding=encoding)
    output.seek(0)

    # Insert data
    connection = engine.raw_connection()
    cursor = connection.cursor()
    cursor.copy_from(output, table, sep=sep, null='')
    connection.commit()
    cursor.close()

我在5秒而不是4分钟内插入200000行

答案 3 :(得分:4)

Pandas 0.25.1具有执行多次插入的参数,因此不再需要使用SQLAlchemy解决此问题。

调用method='multi'时设置pandas.DataFrame.to_sql

在此示例中,它将是 df.to_sql(table, schema=schema, con=e, index=False, if_exists='replace', method='multi')

来自文档here

的答案

值得一提的是,我仅使用Redshift进行了测试。请让我知道它如何在其他数据库上运行,以便我更新此答案。

答案 4 :(得分:3)

我的postgres特定解决方案使用您的pandas数据框自动创建数据库表,并使用postgres执行快速批量插入COPY my_table FROM ...

import io

import pandas as pd
from sqlalchemy import create_engine

def write_to_table(df, db_engine, schema, table_name, if_exists='fail'):
    string_data_io = io.StringIO()
    df.to_csv(string_data_io, sep='|', index=False)
    pd_sql_engine = pd.io.sql.pandasSQL_builder(db_engine, schema=schema)
    table = pd.io.sql.SQLTable(table_name, pd_sql_engine, frame=df,
                               index=False, if_exists=if_exists, schema=schema)
    table.create()
    string_data_io.seek(0)
    string_data_io.readline()  # remove header
    with db_engine.connect() as connection:
        with connection.connection.cursor() as cursor:
            copy_cmd = "COPY %s.%s FROM STDIN HEADER DELIMITER '|' CSV" % (schema, table_name)
            cursor.copy_expert(copy_cmd, string_data_io)
        connection.connection.commit()

答案 5 :(得分:1)

由于这是一个I / O繁重的工作负载,您还可以通过multiprocessing.dummy使用python线程模块。这为我加快了速度:

import math
from multiprocessing.dummy import Pool as ThreadPool

...

def insert_df(df, *args, **kwargs):
    nworkers = 4

    chunksize = math.floor(df.shape[0] / nworkers)
    chunks = [(chunksize * i, (chunksize * i) + chunksize) for i in range(nworkers)]
    chunks.append((chunksize * nworkers, df.shape[0]))
    pool = ThreadPool(nworkers)

    def worker(chunk):
        i, j = chunk
        df.iloc[i:j, :].to_sql(*args, **kwargs)

    pool.map(worker, chunks)
    pool.close()
    pool.join()


....

insert_df(df, "foo_bar", engine, if_exists='append')

答案 6 :(得分:1)

这是简单方法

下载驱动程序以实现SQL数据库连接

对于Linux和Mac OS:

https://docs.microsoft.com/en-us/sql/connect/odbc/linux-mac/installing-the-microsoft-odbc-driver-for-sql-server?view=sql-server-2017

对于Windows:

https://www.microsoft.com/en-us/download/details.aspx?id=56567

创建连接

from sqlalchemy import create_engine 
import urllib
server = '*****'
database = '********'
username = '**********'
password = '*********'

params = urllib.parse.quote_plus(
'DRIVER={ODBC Driver 17 for SQL Server};'+ 
'SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password) 

engine = create_engine("mssql+pyodbc:///?odbc_connect=%s" % params) 

#Checking Connection 
connected = pd.io.sql._is_sqlalchemy_connectable(engine)

print(connected)   #Output is True if connection established successfully

数据插入

df.to_sql('Table_Name', con=engine, if_exists='append', index=False)


"""
if_exists: {'fail', 'replace', 'append'}, default 'fail'
     fail: If table exists, do nothing.
     replace: If table exists, drop it, recreate it, and insert data.
     append: If table exists, insert data. Create if does not exist.
"""

如果有很多记录

# limit based on sp_prepexec parameter count
tsql_chunksize = 2097 // len(bd_pred_score_100.columns)
# cap at 1000 (limit for number of rows inserted by table-value constructor)
tsql_chunksize = 1000 if tsql_chunksize > 1000 else tsql_chunksize
print(tsql_chunksize)


df.to_sql('table_name', con = engine, if_exists = 'append', index= False, chunksize=tsql_chunksize)

PS:您可以根据需要更改参数。

答案 7 :(得分:0)

这对我来说可以使用cx_Oracle和SQLALchemy连接到Oracle数据库

import sqlalchemy
import cx_Oracle
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, String
from sqlalchemy.orm import sessionmaker
import pandas as pd

# credentials
username = "username"
password = "password"
connectStr = "connection:/string"
tableName = "tablename"

t0 = time.time()

# connection
dsn = cx_Oracle.makedsn('host','port',service_name='servicename')

Base = declarative_base()

class LANDMANMINERAL(Base):
    __tablename__ = 'tablename'

    DOCUMENTNUM = Column(String(500), primary_key=True)
    DOCUMENTTYPE = Column(String(500))
    FILENUM = Column(String(500))
    LEASEPAYOR = Column(String(500))
    LEASESTATUS = Column(String(500))
    PROSPECT = Column(String(500))
    SPLIT = Column(String(500))
    SPLITSTATUS = Column(String(500))

engine = create_engine('oracle+cx_oracle://%s:%s@%s' % (username, password, dsn))
conn = engine.connect()  

Base.metadata.bind = engine

# Creating the session

DBSession = sessionmaker(bind=engine)

session = DBSession()

# Bulk insertion
data = pd.read_csv('data.csv')
lists = data.to_dict(orient='records')


table = sqlalchemy.Table('landmanmineral', Base.metadata, autoreload=True)
conn.execute(table.insert(), lists)

session.commit()

session.close() 

print("time taken %8.8f seconds" % (time.time() - t0) )

答案 8 :(得分:0)

适合尝试实施上述解决方案的像我这样的人:

Pandas 0.24.0的to_sql现在带有chunksize和method ='multi'选项,可批量插入...

答案 9 :(得分:-1)

对于任何遇到此问题并将目标数据库设置为Redshift的用户,请注意Redshift不会实现全套Postgres命令,因此某些答案使用Postgres的COPY FROMcopy_from()不管用。 psycopg2.ProgrammingError: syntax error at or near "stdin" error when trying to copy_from redshift

将INSERT加速到Redshift的解决方案是使用文件提取或Odo。

参考:
关于大户 http://odo.pydata.org/en/latest/perf.html
Odo与Redshift
https://github.com/blaze/odo/blob/master/docs/source/aws.rst
Redshift COPY(来自S3文件)
https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html