我想不断地将数据帧行添加到MySQL DB中,以避免任何重复的条目进入MySQL。
我目前正在通过使用df.apply()遍历每一行并调用MySQL insert ignore(duplicates)将唯一的行添加到MySQL数据库中来做到这一点。但是使用pandas.apply非常慢(每10k行需要45秒)。我想使用pandas.to_sql()方法来实现此目的,该方法需要0.5秒将10k条目推入数据库,但不支持在追加模式下忽略重复项。 是否有一种有效且快速的方法来实现这一目标?
输入CSV
Date,Open,High,Low,Close,Volume
1994-01-03,111.7,112.75,111.55,112.65,0
1994-01-04,112.68,113.47,112.2,112.65,0
1994-01-05,112.6,113.63,112.3,113.0,0
1994-01-06,113.02,113.43,112.25,112.62,0
1994-01-07,112.55,112.8,111.5,111.88,0
1994-01-10,111.8,112.43,111.35,112.25,0
1994-01-11,112.18,112.88,112.05,112.4,0
1994-01-12,112.38,112.82,111.95,112.28,0
代码
nifty_data.to_sql(name='eod_data', con=engine, if_exists = 'append', index=False) # option-1
nifty_data.apply(addToDb, axis=1) # option-2
def addToDb(row):
sql = "INSERT IGNORE INTO eod_data (date, open, high, low, close, volume) VALUES (%s,%s,%s,%s,%s,%s)"
val = (row['Date'], row['Open'], row['High'], row['Low'], row['Close'], row['Volume'])
mycursor.execute(sql, val)
mydb.commit()`
option-1: doesn't allow insert ignore (~0.5 secs)
option-2: has to loop through and is very slow (~45 secs)
答案 0 :(得分:3)
您可以创建一个临时表:
nifty_data.to_sql(name='temporary_table', con=engine, if_exists = 'append', index=False)
然后从中运行INSERT IGNORE语句:
with engine.begin() as cnx:
insert_sql = 'INSERT IGNORE INTO eod_data (SELECT * FROM temporary_table)'
cnx.execute(insert_sql)
只需确保列顺序相同,否则您可能必须手动声明它们。