使用条件更新SQL36数据库,使用条件更改列值或附加新行

时间:2017-09-16 22:04:15

标签: python sqlite pandas

我需要连接到现有的SQLite数据库,将键列的值与数据帧中的值进行比较。对于数据库和数据框之间的每个关键匹配,请更新该行中特定列的值。如果数据帧中存在密钥,但数据库中不存在密钥,则将相应的行附加到数据库。目标是相对较大的数据集,因此内存使用和性能是一个问题(可以是20-60 gb db,@ ~20列和数百万行)。

我之前曾尝试将数据库读入数据框并将旧数据帧和新数据帧合并到内存中,但这证明价格昂贵(通常是5 gig数据集在内存中增加到20 gig)。

我迷失在这里的逻辑中,这是我离开的最远的地方:

def update_column(tablename, key_value):
    c.execute('SELECT key FROM {}'.format(tablename))
    for row in c.fetchall():
        # populating this key value per row is challenging for me
        if row == key_value: 
            c.execute('UPDATE {} SET last_seen = {} WHERE UUID = {}}'.format(tablename, hunt_date, key_value))
        else:
            df.to_sql(table_name, if_exists='append')

for index, row in reader.iterrows():
    key_value = row['key']
    update_column(tablename, key_value)

示例数据集:

数据库

Key       First_Seen Last_Seen Data1  Data2
Bigfoot   2015       2015      Blah   Blah
Loch_Ness 2016       2016      Blah   Blah
UFO       2016       2004      Blah   Blah     

包含新数据的数据框:

Key       First_Seen Last_Seen Data  Data
UFO       2017       2017      Blah  Blah
Tupac     2017       2017      Blah  Blah

数据库中所需的输出:

Key       First_Seen Last_Seen Data  Data
Bigfoot   2015       2015      Blah  Blah
Loch_Ness 2016       2016      Blah  Blah
UFO       2016       2017      Blah  Blah
Tupac     2017       2017      Blah  Blah

2 个答案:

答案 0 :(得分:2)

我在SQLite端做了这样的更新。

首先将DF保存为临时SQLite表:tmp

df.to_sql('tmp', conn, if_exists='replace')

sql = """
UPDATE table_name set last_seen = (SELECT t.last_seen
                                   FROM tmp t
                                   WHERE t.Key = table_name.key)
WHERE EXISTS(
    SELECT *
    FROM tmp
    WHERE tmp.key = table_name.key
)
"""

c.execute(sql)

答案 1 :(得分:2)

根据建议,考虑SQLite中的临时表并运行UPDATEINSERT INTO查询。无需遍历数百万行。

由于SQLite不支持UPDATE...JOIN,因此需要子查询,例如IN子句。每次运行追加查询都没有坏处,因为它只会附加新的行。

df.to_sql('pandastable', conn, if_exists='replace')

c.execute("UPDATE finaltable f " + \
          "SET f.last_seen = p.last_seen " + \
          "WHERE f.[key] IN (SELECT p.[key] FROM pandastable p);")
conn.commit()

c.execute("INSERT INTO finaltable ([key], first_seen, last_seen, blah, blah, blah) " + \
          "SELECT [key], first_seen, last_seen, blah, blah, blah " + \
          "FROM pandastable p " + \
          "WHERE NOT EXISTS " + \
          "   (SELECT 1 FROM finaltable sub " + \
          "    WHERE sub.[key] = p.[key]);")
conn.commit()

如果将pandas与SQLAlchemy连接而不是原始连接,请考虑使用事务而不是游标调用运行操作查询:

import sqlalchemy

...
engine = sqlalchemy.create_engine("sqlite:sqlite:////path/to/database.db")

df.to_sql(name='pandastable', con=engine, if_exists='replace')

# SQL ACTIONS USING TRANSACTIONS
with engine.begin() as conn:     
    conn.execute("UPDATE finaltable f " + \
                 "SET f.last_seen = p.last_seen " + \
                 "WHERE f.[key] IN (SELECT p.[key] FROM pandastable p);")

with engine.begin() as conn:     
    conn.execute("INSERT INTO finaltable ([key], first_seen, last_seen, blah, blah, blah) " + \
                 "SELECT [key], first_seen, last_seen, blah, blah, blah " + \
                 "FROM pandastable p " + \
                 "WHERE NOT EXISTS " + \
                 "   (SELECT 1 FROM finaltable sub " + \
                 "    WHERE sub.[key] = p.[key]);")

engine.dispose()