从大熊猫写到sqlite时提高性能

时间:2018-07-30 21:00:14

标签: python pandas sqlite cython

希望找到一些有关如何优化代码的指针...理想情况下,我想继续使用熊猫,但假设可以使用一些漂亮的sqlite技巧来提高速度。对于其他“要点”,想知道Cython是否可以在这里提供帮助?

如果从代码中看不出来。.​​对于上下文,我必须写出数百万个非常小的sqlite文件(“ uncompressedDir”中的文件),然后将它们输出到更大的“ master” sqlite DB中(“第6个jan.db”)。

提前感谢大家!

%%cython -a

import os
import pandas as pd
import sqlite3
import time
import sys

def main():

    rootDir = "/Users/harryrobinson/Desktop/dataForMartin/"
    unCompressedDir = "/Users/harryrobinson/Desktop/dataForMartin/unCompressedSqlFiles/"

    with sqlite3.connect(rootDir+'6thJan.db') as conn:

        destCursor = conn.cursor()

        createTable = "CREATE TABLE IF NOT EXISTS userData(TimeStamp, Category, Action, Parameter1Name, Parameter1Value, Parameter2Name, Parameter2Value, formatVersion, appVersion, userID, operatingSystem)"
        destCursor.execute(createTable)


    for i in os.listdir(unCompressedDir):

        try:
            with sqlite3.connect(unCompressedDir+i) as connection:
                cursor = connection.cursor()
                cursor.execute('SELECT * FROM Events')
                df_events = pd.DataFrame(cursor.fetchall())
                cursor.execute('SELECT * FROM Global')
                df_global = pd.DataFrame(cursor.fetchall())

                cols = ['TimeStamp', 'Category', 'Action', 'Parameter1Name', 'Parameter1Value', 'Parameter2Name', 'Parameter2Value']
                df_events = df_events.drop(0,axis=1)
                df_events.columns = cols

                df_events['formatVersion'] = df_global.iloc[0,0]
                df_events['appVersion'] = df_global.iloc[0,1]
                df_events['userID'] = df_global.iloc[0,2]
                df_events['operatingSystem'] = df_global.iloc[0,3]

        except Exception as e:
            print(e, sys.exc_info()[-1].tb_lineno)

        try:
            df_events.to_sql("userData", conn, if_exists="append", index=False)
        except Exception as e:
            print("Sqlite error, {0} - line {1}".format(e, sys.exc_info()[-1].tb_lineno))

更新:通过添加事务而不是to_sql来减少时间

1 个答案:

答案 0 :(得分:0)

重新考虑使用Pandas作为暂存工具(将库留作数据分析)。只需编写纯SQL查询,就可以使用SQLite的ATTACH来查询外部数据库。

with sqlite3.connect(os.path.join(rootDir,'6thJan.db')) as conn:

        destCursor = conn.cursor()

        createTable = """CREATE TABLE IF NOT EXISTS userData(
                            TimeStamp TEXT, Category TEXT, Action TEXT, Parameter1Name TEXT, 
                            Parameter1Value TEXT, Parameter2Name TEXT, Parameter2Value TEXT, 
                            formatVersion TEXT, appVersion TEXT, userID TEXT, operatingSystem TEXT
                         );"""

        destCursor.execute(createTable)
        conn.commit()

        for i in os.listdir(unCompressedDir):

             destCursor.execute("ATTACH ? AS curr_db;", i)

             sql = """INSERT INTO userData
                      SELECT e.*, g.formatVersion, g.appVersion, g.userID, g.operatingSystem
                      FROM curr_db.[events] e
                      CROSS JOIN (SELECT * FROM curr_db.[global] LIMIT 1) g;"""

             destCursor.execute(sql)
             conn.commit() 

             destCursor.execute("DETACH curr_db;")