Question

我正在编写一个脚本，它创建一些数据的哈希并将其保存在数据库中。所需的数据来自SQL查询，该查询连接大约300k行和500k行。在解析结果时，我使用第二个连接处理程序创建哈希值并在数据库中更新（使用第一个连接处理程序给出了“未读结果”错误）。

经过大量调查，我发现在表现方面给我最好成绩的是：

每x次迭代重新启动选择查询。否则，更新会在一段时间后变得慢很多
仅提交每200个查询，而不是提交每个查询
用于select查询的表是MyISAM，并使用主键和join中使用的字段进行索引。
我的哈希表是InnoDB，并且只有索引的主键（id）。

这是我的剧本：

commit = ''       
stillgoing = True    
limit1 = 0
limit2 = 50000    
i = 0    
while stillgoing:    
    j = 0    
    # rerun select query every 50000 results
    getProductsQuery = ("SELECT distinct(p.id), p.desc, p.macode, p.manuf, "
        "u.unit, p.weight, p.number, att1.attr as attribute1, p.vcode, att2.attr as attribute2 "
        "FROM p "
        "LEFT JOIN att1 on p.id = att1.attid and att1.attrkey = 'PARAM' "
        "LEFT JOIN att2 on p.id = att2.attid and att2.attrkey = 'NODE' "
        "LEFT JOIN u on p.id = u.umid and u.lang = 'EN' "
        "limit "+str(limit1)+", "+str(limit2))                           
    db.query(getProductsQuery)
    row = db.fetchone()              
    while row is not None:
        i += 1
        j += 1
        id = str(row[0])
        # create hash value
        to_hash = '.'.join( [ helper.tostr(s) for s in row[1:]] )
        hash = hashlib.md5(to_hash.encode('utf-8')).hexdigest()
        # set query
        updQuery = ("update hashtable set hash='"+hash+"' where id="+id+" limit 1" )         
        # commit every 200 queries
        commit = 'no'
        if (i%200==0):
            i = 0
            commit = 'yes'
        # db2 is a second instance of db connexion
        # home made db connexion class
        # query function takes two parameters: query, boolean for commit
        db2.query(updQuery,commit)            
        row = db.fetchone()        
    if commit == 'no':
        db2.cnx.commit()            
    if j < limit2:
        stillgoing = False
    else:
        limit1 += limit2

目前，脚本需要1小时30到2小时才能完全运行。这是我从第一个版本的脚本以来获得的更好的表现。我能做些什么来让它跑得更快吗？

Answer 1

我认为你应该能够在MySQL中完全做到这一点：

{{1}}

Answer 2

... LIMIT 0,200  -- touches 200 rows
... LIMIT 200,200  -- touches 400 rows
... LIMIT 400,200  -- touches 600 rows
... LIMIT 600,200  -- touches 800 rows
...

拍照？ LIMIT + OFFSET是O（N * N）。二次慢。

要将其降低到O（N），您需要执行单个线性扫描。如果单个查询（没有LIMIT / OFFSET），花费的时间太长，那么在表格中走过表格＆＃39;：

... WHERE id BETWEEN 1 AND 200  -- 200 rows
... WHERE id BETWEEN 201 AND 400  -- 200 rows
... WHERE id BETWEEN 401 AND 600  -- 200 rows
... WHERE id BETWEEN 601 AND 800  -- 200 rows

关于此类here的博客。如果您要更新的表是InnoDB且PRIMARY KEY(id)，那么id的分块非常有效。

您可以拥有autocommit=1，以便每个200行UPDATE自动COMMITs。

哦，你的桌子正在使用古董引擎，MyISAM？好吧，它会运行得相当好。

针对大量更新的脚本优化

2 个答案: