Question

以下任务的最佳/最快解决方案是什么：

使用的技术：MySQL数据库+ Python

我正在下载data.sql文件。格式：

INSERT INTO `temp_table` VALUES (group_id,city_id,zip_code,post_code,earnings,'group_name',votes,'city_name',person_id,'person_name',networth);
INSERT INTO `temp_table` VALUES (group_id,city_id,zip_code,post_code,earnings,'group_name',votes,'city_name',person_id,'person_name',networth);
.
.

每行中的值都不同。

表结构：http://sqlfiddle.com/#!9/8f10d6

一个人可以有多个城市

一个人只能在一个小组中，也可以不在一个小组中。

一个小组可以有多个人

我知道这些.sql数据来自哪个国家。

我需要将这些数据分成3个表。我将更新表中已经存在的数据，如果没有，那么我将创建新行。所以我想出了两种解决方案：

通过python分割文件中的值，然后为每行执行3x选择+ 3x更新/插入事务。
以某种方式将数据批量插入临时表中，然后在数据库中处理数据-这意味着对于临时表中的每一行，我将执行3个选择查询（对每个实际表一个），如果我发现我将发送3x行（更新查询，如果没有，则运行插入查询）。

我每天将在.sql文件中超过1万行的情况下多次运行此函数，并且它将在数据库中更新/创建超过3万行。

// EDIT

我现在正在插入/更新代码：

autocomit = "SET autocommit=0"
with connection.cursor() as cursor:
    cursor.execute(autocomit)
data = data.sql
lines = data.splitlines
for line in lines:
    with connection.cursor() as cursor: 
        cursor.execute(line)

temp_data = "SELECT * FROM temp_table"
with connection.cursor() as cursor: 
    cursor.execute(temp_data)
    temp_data = cursor.fetchall()

for temp_row in temp_data:
    group_id = temp_row[0]
    city_id = temp_row[1]
    zip_code = temp_row[2]
    post_code = temp_row[3]
    earnings = temp_row[4]
    group_name = temp_row[5]
    votes = temp_row[6]
    city_name = temp_row[7]
    person_id = temp_row[8]
    person_name = temp_row[9]
    networth = temp_row[10]

group_select = "SELECT * FROM perm_group WHERE group_id = %s AND countryid_fk = %s"
group_values = (group_id, countryid)
    with connection.cursor() as cursor:
        row = cursor.execute(group_select, group_values)
    if row == 0 and group_id != 0: #If person doesn't have group do not create
       group_insert = "INSERT INTO perm_group (group_id, group_name, countryid_fk) VALUES (%s, %s, %s)"
       group_insert_values = (group_id, group_name, countryid)
       with connection.cursor() as cursor:
            cursor.execute(group_insert, group_insert_values) 
            groupid = cursor.lastrowid
    elif row == 1 and group_id != 0: 
        group_update = "UPDATE perm_group SET group_name = group_name WHERE group_id = %s and countryid_fk = %s"
        group_update_values = (group_id, countryid)
        with connection.cursor() as cursor:
            cursor.execute(group_update, group_update_values)

        #Select group id for current row to assign correct group to the person    
        group_certain_select = "SELECT id FROM perm_group WHERE group_id = %s and countryid_fk = %s"
        group_certain_select_values = (group_id, countryid)
        with connection.cursor() as cursor:
            cursor.execute(group_certain_select, group_certain_select_values)
            groupid = cursor.fetchone()
#.
#.
#.
#Repeating the same piece of code for person and city

测量的时间：206秒-这是不可接受的。

       group_insert = "INSERT INTO perm_group (group_id, group_name, countryid_fk) VALUES (%s, %s, %s) ON DUPLICATE KEY UPDATE group_id = %s, group_name = %s"
       group_insert_values = (group_id, group_name, countryid, group_id, group_name)
       with connection.cursor() as cursor:
            cursor.execute(group_insert, group_insert_values) 

        #Select group id for current row to assign correct group to the person    
        group_certain_select = "SELECT id FROM perm_group WHERE group_id = %s and countryid_fk = %s"
        group_certain_select_values = (group_id, countryid)
        with connection.cursor() as cursor:
            cursor.execute(group_certain_select, group_certain_select_values)
            groupid = cursor.fetchone()

测量时间：30至50秒。（仍然持续了很长时间，但是越来越好了）

还有其他更好（更快）的选择方法吗？

谢谢你，爆米花

Answer 1

我建议您将数据加载到暂存表中，然后使用SQL进行处理。

基本上，最终结果是一组SQL表，因此SQL必将成为解决方案的一部分。您只需将所需的工具数量简化为尽可能多的逻辑放入数据库中即可。

加载10,000行应该不会花费很多时间。但是，如果您可以选择数据格式，则建议在insert s上使用CSV文件。 insert可能会产生额外的开销（如果仅仅是因为它们更大）。

一旦数据在数据库中，我就不必担心将数据存储在三个表中的处理时间。

任务解决方案以获得最佳性能

1 个答案: