我需要使用python在SQLite中创建crsp.dsf的副本 crsp.dsf是一个非常大的表,我建议进行流式读取或编写一个按日期或permno对crsp.dsf进行迭代的循环。最后,我应该验证我是否具有相同数量的观测值,唯一的日期,唯一的行和(大约)sum(round(ret,5))。
我的代码如下:
# import packages
import pandas as pd
import numpy as np
import mysql.connector
from dplython import (DplyFrame, X, diamonds, select, sift,
sample_n, sample_frac, head, arrange, mutate, group_by,
summarize, DelayFunction)
# def functions
def connect(db="crsp"):
conn = mysql.connector.connect(host="178.128.92.75",
port=3306,user="student_temp_user",
passwd="temp_userYxL",
db=db)
return conn
def queryMySQL(x,db='crsp'):
conn=connect(db=db)
response = pd.read_sql(x, conn)
conn.close()
return response
# write a loop that iterates on crsp.dsf by permno
permnos = queryMySQL('select distinct permno from crsp.msf').loc[:,'permno']
permnos.unique()
i=0
for item in permnos.unique():
result=queryMySQL('select * ' +
'from crsp.msf where permno = ' + str(item) + ' limit 5')
i=i+1
if i==1:
msf=pd.DataFrame(result)
if(i>1):
msf = msf.append(pd.DataFrame(result))
if i==83436:
break # take long time
msf
# write my own SQL table
from sqlalchemy import create_engine
engine = create_engine("mysql+pymysql://student_temp_user:temp_userYxL@178.128.92.75/student_temp_user")
msf.to_sql(name='msf_copy',con=engine, if_exists='append', index=False)
# verification
print(queryMySQL('select count(*), count(distinct date), count(distinct permno), sum(round(ret,5)) from student_temp_user.msf_copy')
print(queryMySQL('select count(*), count(distinct date), count(distinct permno), sum(round(ret,5)) from crsp.msf')
在循环中,我超过8万人,我想这就是为什么要花这么长时间的原因。如何改进代码?也许我的代码有问题。