使用sqlalchemy将csv文件加载到数据库中

时间:2015-07-13 23:09:43

标签: python database sqlalchemy

我想将csv文件用于数据库

5 个答案:

答案 0 :(得分:36)

由于SQLAlchemy的强大功能,我也在项目中使用它。它的强大功能来自面向对象的“与数据库交谈”的方式,而不是硬编码可能很难管理的SQL语句。更不用说,它也快得多。

直截了当地回答你的问题,是的!使用SQLAlchemy将CSV中的数据存储到数据库中是一件小事。这是一个完整的工作示例(我使用SQLAlchemy 1.0.6和Python 2.7.6):

C:\WINDOWS\system32

(注意:这不一定是执行此操作的“最佳”方式,但我认为这种格式对于初学者来说非常易读;它也非常快:插入251条记录时为0.091秒!)

我想如果你逐行浏览它,你会发现使用它是多么轻而易举。注意缺少SQL语句 - 万岁!我也冒昧地使用numpy将CSV内容加载到两行中,但是如果你愿意,可以在没有它的情况下完成。

如果你想与传统的做法进行比较,这里有一个完整的例子供参考:

from numpy import genfromtxt
from time import time
from datetime import datetime
from sqlalchemy import Column, Integer, Float, Date
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

def Load_Data(file_name):
    data = genfromtxt(file_name, delimiter=',', skip_header=1, converters={0: lambda s: str(s)})
    return data.tolist()

Base = declarative_base()

class Price_History(Base):
    #Tell SQLAlchemy what the table name is and if there's any table-specific arguments it should know about
    __tablename__ = 'Price_History'
    __table_args__ = {'sqlite_autoincrement': True}
    #tell SQLAlchemy the name of column and its attributes:
    id = Column(Integer, primary_key=True, nullable=False) 
    date = Column(Date)
    opn = Column(Float)
    hi = Column(Float)
    lo = Column(Float)
    close = Column(Float)
    vol = Column(Float)

if __name__ == "__main__":
    t = time()

    #Create the database
    engine = create_engine('sqlite:///csv_test.db')
    Base.metadata.create_all(engine)

    #Create the session
    session = sessionmaker()
    session.configure(bind=engine)
    s = session()

    try:
        file_name = "t.csv" #sample CSV file used:  http://www.google.com/finance/historical?q=NYSE%3AT&ei=W4ikVam8LYWjmAGjhoHACw&output=csv
        data = Load_Data(file_name) 

        for i in data:
            record = Price_History(**{
                'date' : datetime.strptime(i[0], '%d-%b-%y').date(),
                'opn' : i[1],
                'hi' : i[2],
                'lo' : i[3],
                'close' : i[4],
                'vol' : i[5]
            })
            s.add(record) #Add all the records

        s.commit() #Attempt to commit all the records
    except:
        s.rollback() #Rollback the changes on error
    finally:
        s.close() #Close the connection
    print "Time elapsed: " + str(time() - t) + " s." #0.091s

(注意:即使是以“旧”的方式,这绝不是最好的方法,但它是非常易读的,并且是SQLAlchemy方式与“旧”方式的“一对一”翻译方式。)

注意SQL语句:一个用于创建表,另一个用于插入记录。另外,请注意,维护长SQL字符串与简单的类属性添加相比,它更麻烦一些。喜欢SQLAlchemy到目前为止?

至于你的外键查询,当然。 SQLAlchemy也有能力做到这一点。下面是一个使用外键赋值的类属性的示例(假设import sqlite3 import time from numpy import genfromtxt def dict_factory(cursor, row): d = {} for idx, col in enumerate(cursor.description): d[col[0]] = row[idx] return d def Create_DB(db): #Create DB and format it as needed with sqlite3.connect(db) as conn: conn.row_factory = dict_factory conn.text_factory = str cursor = conn.cursor() cursor.execute("CREATE TABLE [Price_History] ([id] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL UNIQUE, [date] DATE, [opn] FLOAT, [hi] FLOAT, [lo] FLOAT, [close] FLOAT, [vol] INTEGER);") def Add_Record(db, data): #Insert record into table with sqlite3.connect(db) as conn: conn.row_factory = dict_factory conn.text_factory = str cursor = conn.cursor() cursor.execute("INSERT INTO Price_History({cols}) VALUES({vals});".format(cols = str(data.keys()).strip('[]'), vals=str([data[i] for i in data]).strip('[]') )) def Load_Data(file_name): data = genfromtxt(file_name, delimiter=',', skiprows=1, converters={0: lambda s: str(s)}) return data.tolist() if __name__ == "__main__": t = time.time() db = 'csv_test_sql.db' #Database filename file_name = "t.csv" #sample CSV file used: http://www.google.com/finance/historical?q=NYSE%3AT&ei=W4ikVam8LYWjmAGjhoHACw&output=csv data = Load_Data(file_name) #Get data from CSV Create_DB(db) #Create DB #For every record, format and insert to table for i in data: record = { 'date' : i[0], 'opn' : i[1], 'hi' : i[2], 'lo' : i[3], 'close' : i[4], 'vol' : i[5] } Add_Record(db, record) print "Time elapsed: " + str(time.time() - t) + " s." #3.604s 类也已从ForeignKey模块导入):

sqlalchemy

将“fid”列指向Price_History的id列的外键。

希望有所帮助!

答案 1 :(得分:31)

如果您的CSV非常大,使用INSERTS效果非常差。您应该使用不同基础的批量加载机制。例如。在PostgreSQL中你应该使用" COPY FROM"方法:

with open(csv_file_path, 'r') as f:    
    conn = create_engine('postgresql+psycopg2://...').raw_connection()
    cursor = conn.cursor()
    cmd = 'COPY tbl_name(col1, col2, col3) FROM STDIN WITH (FORMAT CSV, HEADER FALSE)'
    cursor.copy_expert(cmd, f)
    conn.commit()

答案 2 :(得分:2)

带有逗号和 PostrgeSQL 标题名称的 CSV 文件

  1. 我使用的是 csv Python 阅读器。 CSV 数据以逗号 (,) 分隔
  2. 然后将其转换为 Pandas DataFrame。列的名称与您的 csv 文件中的名称相同。
  3. 结束最后一个 DataFrame 到 sql,引擎作为到 DB 的连接。 if_exists='replace/append'
import csv
import pandas as pd
from sqlalchemy import create_engine

# Create engine to connect with DB
try:
    engine = create_engine(
        'postgresql://username:password@localhost:5432/name_of_base')
except:
    print("Can't create 'engine")

# Get data from CSV file to DataFrame(Pandas)
with open('test.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    columns = [
        'moment',
        'isin',
        'name'
    ]
    df = pd.DataFrame(data=reader, columns=columns)

# Standart method of Pandas to deliver data from DataFrame to PastgresQL
try:
    with engine.begin() as connection:
        df.to_sql('name_of_table', con=connection, index_label='id', if_exists='replace')
        print('Done, ok!')
except:
    print('Something went wrong!')

答案 3 :(得分:1)

我遇到了完全相同的问题,但发现对熊猫使用两步过程反而容易:

import pandas as pd
with open(csv_file_path, 'r') as file:
    data_df = pd.read_csv(file)
data_df.to_sql('tbl_name', con=engine, index=True, index_label='id', if_exists='replace')

请注意,我的方法类似于this one,但是Google不知何故将我发送到了该线程,所以我认为我会分享。

答案 4 :(得分:1)

要使用sqlalchemy将相对较小的CSV文件导入数据库,可以使用engine.execute(my_table.insert(), list_of_row_dicts),如"Executing Multiple Statements" section of the sqlalchemy tutorial中所述。

这有时被称为“执行方式”调用方式,因为它会产生executemany DBAPI call。数据库驱动程序可能会执行一个多值INSERT .. VALUES (..), (..), (..)语句,这将导致往返数据库的往返次数减少并且执行速度更快:

根据sqlalchemy's FAQ,这是不使用特定于数据库的大容量加载方法(例如Postgres中的COPY FROM,MySQL中的LOAD DATA LOCAL INFILE等)即可获得的最快速度。它比使用普通ORM(如@Manuel J.Diaz的回答),bulk_save_objectsbulk_insert_mappings的速度要快。

import csv
from sqlalchemy import create_engine, Table, Column, Integer, MetaData

engine = create_engine('sqlite:///sqlalchemy.db', echo=True)

metadata = MetaData()
# Define the table with sqlalchemy:
my_table = Table('MyTable', metadata,
    Column('foo', Integer),
    Column('bar', Integer),
)
metadata.create_all(engine)
insert_query = my_table.insert()

# Or read the definition from the DB:
# metadata.reflect(engine, only=['MyTable'])
# my_table = Table('MyTable', metadata, autoload=True, autoload_with=engine)
# insert_query = my_table.insert()

# Or hardcode the SQL query:
# insert_query = "INSERT INTO MyTable (foo, bar) VALUES (:foo, :bar)"

with open('test.csv', 'r', encoding="utf-8") as csvfile:
    csv_reader = csv.reader(csvfile, delimiter=',')
    engine.execute(
        insert_query,
        [{"foo": row[0], "bar": row[1]} 
            for row in csv_reader]
    )