Question

我正在尝试处理超过1GB的文本文件，并使用python将数据保存到Mysql数据库中。

我在下面粘贴了一些示例代码

import os
import MySQLdb as mdb

conn = mdb.connect(user='root', passwd='redhat', db='Xml_Data', host='localhost', charset="utf8")

file_path = "/home/local/user/Main/Module-1.0.4/file_processing/part-00000.txt"

file_open = open('part-00000','r')

for line in file_open:
    result_words = line.split('\t')
    query = "insert into PerformaceReport (campaignID, keywordID, keyword, avgPosition)"
    query += " VALUES (%s,%s,'%s',%s) " % (result_words[0],result_words[1],result_words[2],result_words[3])
    cursor = conn.cursor()
    cursor.execute( query )
    conn.commit()

实际上插入数据的列数超过18列，我刚刚粘贴了四列（例如）

因此，当我运行上面的代码时，执行时间会花费一些hours

我所有的怀疑都是

是否有任何替代方法可以非常快速地在python中处理1GB文本文件？
是否有任何框架可以处理1GB文本文件并将数据快速保存到数据库中？
如何在几分钟内处理大尺寸（1GB）的文本文件（是否可能）并将数据保存到数据库中？我所关心的是，我们需要尽快处理1GB文件，但不能在数小时内处理

已编辑的代码

query += " VALUES (%s,%s,'%s',%s) " % (int(result_words[0] if result_words[0] != '' else ''),int(result_words[2] if result_words[2] != '' else ''),result_words[3] if result_words[3] != '' else '',result_words[4] if result_words[4] != '' else '')

实际上我正在以上述格式提交值（通过检查结果存在）

Answer 1

有点猜测，但我会说文件中每行的conn.commit()都会产生很大的不同。尝试将其移出循环。您也不需要在循环的每次迭代中重新创建游标 - 只需在循环之前执行一次。

Answer 2

除了蒂姆所说的，我还会看看MySQL的LOAD DATA INFILE。在Python中进行任何必要的预处理并将其写入MySQL可以访问的单独文件，然后执行适当的查询并让MySQL进行加载。

或者，可能会将Python代码重写为它应该是什么（您应该将参数作为值传递，而不是执行字符串操作 - 对一个SQL注入攻击）：

query = 'insert into something(a, b, c, d) values(%s, %s, %s, %s)'
with open('file.tab') as fin:
    values = (row.split('\t')[:4] for row in fin)
    cursor.executemany(query, values)

Answer 3

import os
import MySQLdb as mdb
import csv

def read_file():
    file_path = "/home/local/user/Main/Module-1.0.4/file_processing/part-00000.txt"
    with open('part-00000','r') as infile:
        file_open= csv.reader(infile, delimiter='\t')
        cache = []
        for line in file_open:
            cache.append(line)
            if len(cache) > 500:
                yield cache
                cache = []
        yield cache 

conn = mdb.connect(user='root', passwd='redhat', db='Xml_Data', host='localhost', charset="utf8")
cursor = conn.cursor()
query = "insert into PerformaceReport (campaignID, keywordID, keyword, avgPosition) VALUES (%s,%s,%s,%s)"
for rows in read_file():
    try:
        cursor.executemany(query, rows)
    except mdb.Error:
        conn.rollback()
    else:
        conn.commit()

代码未经测试且可能包含轻微错误，但应该更快，但不如使用LOAD DATA INFILE快。

如何使用python处理1GB的文本文件

3 个答案: