使用Python将超速(Bulk)插入MySQL

时间:2017-10-03 11:24:19

标签: python mysql csv

我部署应用程序以使用一些.csv数据。我想将它们复制到MySQL表中。在stackoverflow用户的帮助下,我编写了以下代码:

import csv
import MySQLdb

db = MySQLdb.connect(   host = "dbname.description.host.com",
                        user = "user",
                        passwd = "key",
                        db = "dbname")
cursor = db.cursor()

query = 'INSERT INTO table_name(column,column_1,column_2,column_3)
VALUES(%s, %s, %s, %s)'                                                         

csv_data = csv.reader(file('file_name'))

for row in csv_data:
     cursor.execute(query,row)
     db.commit()

cursor.close()

问题是,目前这个过程太慢了,我需要加快速度。

THX

7 个答案:

答案 0 :(得分:1)

您可以使用RedirectMatch 301 ^/$ /de/ RedirectMatch 301 ^/site_1/$ https://www.new.com/de/company/site_1/ RedirectMatch 301 ^/services/site_2/$ https://www.new.com/de/services/site_1/ 批处理作业,如下所示

executemany

答案 1 :(得分:0)

承诺提交:

for row in csv_data:
     cursor.execute(query,row)
db.commit()

它会减少工作量并且会更快

答案 2 :(得分:0)

您使用的代码由于多种原因而非常低效,因为您一次将每个数据提交一行(这可能是您想要的事务数据库或进程),而不是一次性转储。

有很多方法可以加快速度,从优秀到不优秀。这里有4种方法,包括幼稚实施(上图)

.test

odo方法最快(在引擎盖下使用mysql LOAD DATA INFILE) 接下来是Pandas(关键代码路径已经过优化) 接下来是使用原始游标,但批量插入行 最后是天真的方法,一次提交一行

以下是针对本地MySQL服务器本地运行的一些示例。

using_odo(./ test.py:29):     0.516秒

using_pandas(./ test.py:23):     3.039秒

using_cursor_correct(./ test.py:50):     12.847秒

using_cursor(./ test.py:34):     43.470秒

计数表1 - 100000

计数表2 - 100000

计数表3 - 100000

计数表4 - 100000

正如你所看到的,天真的实现比odo慢约100倍。 比使用熊猫慢10倍

答案 3 :(得分:0)

解决方案是使用来自MySQL的batch insert

因此,您需要获取所有要插入的值,并将它们转换为单个字符串,用作execute()方法的参数。

最后,SQL应该看起来像:

INSERT INTO table_name (`column`, `column_1`, `column_2`, `column_3`) VALUES('1','2','3','4'),('4','5','6','7'),('7','8','9','10');

这里是一个例子:

#function to transform your list into a string
def stringify(v): 
    return "('%s', '%s', %s, %s)" % (v[0], v[1], v[2], v[3])

#transform all to string
v = map(stringify, row)

#glue them together
batchData = ", ".join(e for e in v)

#complete the SQL
sql = "INSERT INTO `table_name`(`column`, `column_1`, `column_2`, `column_3`) \
VALUES %s" % batchData

#execute it
cursor.execute(sql)
db.commit()

答案 4 :(得分:0)

这里有一些统计数据可以支持@Mung Tung的回答。 #!/usr/bin/sh # An example hook script to verify that each commit that is about to be pushed # pass the `./run_tests` suite. Called by "git push" after it has checked the # remote status, but before anything has been pushed. # If the test suite (and so the script) exits with a non-zero status, nothing # will be pushed. # # In any case, we revert to the pre `$ git push` state. # Retrieve arguments remote="$1" url="$2" z40=0000000000000000000000000000000000000000 # SHA of a non existing commit # Save current "git state" current_branch=$(git rev-parse --abbrev-ref HEAD) STASH_NAME="pre-push-$(date +%s)" git stash save -q --keep-index $STASH_NAME # Do wonders while read local_ref local_sha remote_ref remote_sha do if [ "$local_sha" = $z40 ] then # Handle delete continue # to the next branch elif [ "$remote_sha" = $z40 ] then # New branch, examine all commits range="$local_sha" else # Update to existing branch, examine new commits range="$remote_sha..$local_sha" fi # Retrieve list of commit in "chronological" order commits=$(git rev-list --reverse $range) # Loop over each commit for commit in $commits do git checkout $commit # Run the tests ./test/run_tests.sh # Retrieve exit code is_test_passed=$? # Stop iterating if error if [ $is_test_passed -ne 0 ] then echo -e "Aborting push: Test failed for commit $commit,"\ "with following error trace:\n" # something like: tail test/run_tests.log break 2 fi done fi done # Revert to pre-push state git checkout $current_branch STASH_NUM=$(git stash list | grep $STASH_NAME | sed -re 's/stash@\{(.*)\}.*/\1/') if [ -n "$STASH_NUM" ] then git stash pop -q stash@{$STASH_NUM} fi # Return exit code exit $is_test_passed 会执行executemanyexecute很难在1秒内达到315次插入,而execute则达到25,000次插入。

基本计算机配置-

executemany

结果:

2.7 GHz Dual-Core Intel Core i5
16 GB 1867 MHz DDR3
Flash Storage

答案 5 :(得分:0)

我通过使用元组数组解决了这个问题,并将其放入execute语句。处理1英里时。行只用了8分钟。尝试尽可能避免迭代con.execute命令

def process_csv_file4(csv_file, conn):
df = pd.read_csv(csv_file,sep=';',
                 names=['column'])

query = """
        INSERT INTO table
            (column)
        VALUES 
            (%s) 
        ON DUPLICATE KEY UPDATE 
            column= VALUES(column);
        """    

conn.execute(query, tuple(df.values))

答案 6 :(得分:0)

我正在使用 SQL 炼金术库通过 python 脚本加速从 CSV 文件到 MySql 数据库的批量插入。数据库中的数据将以文本格式插入,因此连接到数据库工作台并更改数据类型,数据就可以使用了。

第 1 步: 在命令终端中使用“pip install sqlalchemy”和“pip install mysqlclient”。

import MySQLdb

import sqlalchemy

from sqlalchemy import  create_engine

Step 2:
Then create a connection string of create engine through SQL alchemy.

######Create Engine####

syntax- enginecreate_engine("mysql+mysqldb://username:password@hostadress:3306/username")

egzample-

enginecreate_engine("mysql+mysqldb://abc9:abc$123456@127.10.23.1:2207/abc9")
conn=engine.connect()

print(engine);

###########Define your python code##############

def function_name():

    data = pd.read_csv(filepath/file.csv')   
    data_frame = data.to_sql('database_name', engine, method='multi',index=False, 
    if_exists='replace')

############Close Connection###############

conn = engine.raw_connection()

conn.commit()

conn.close()

运行代码,4分钟内可以插入200万行!!

将此参考链接用于不同的数据库驱动程序:

https://overiq.com/sqlalchemy-101/installing-sqlalchemy-and-connecting-to-database/