我需要连接到现有的SQLite数据库,将键列的值与数据帧中的值进行比较。对于数据库和数据框之间的每个关键匹配,请更新该行中特定列的值。如果数据帧中存在密钥,但数据库中不存在密钥,则将相应的行附加到数据库。目标是相对较大的数据集,因此内存使用和性能是一个问题(可以是20-60 gb db,@ ~20列和数百万行)。
我之前曾尝试将数据库读入数据框并将旧数据帧和新数据帧合并到内存中,但这证明价格昂贵(通常是5 gig数据集在内存中增加到20 gig)。
我迷失在这里的逻辑中,这是我离开的最远的地方:
def update_column(tablename, key_value):
c.execute('SELECT key FROM {}'.format(tablename))
for row in c.fetchall():
# populating this key value per row is challenging for me
if row == key_value:
c.execute('UPDATE {} SET last_seen = {} WHERE UUID = {}}'.format(tablename, hunt_date, key_value))
else:
df.to_sql(table_name, if_exists='append')
for index, row in reader.iterrows():
key_value = row['key']
update_column(tablename, key_value)
示例数据集:
数据库
Key First_Seen Last_Seen Data1 Data2
Bigfoot 2015 2015 Blah Blah
Loch_Ness 2016 2016 Blah Blah
UFO 2016 2004 Blah Blah
包含新数据的数据框:
Key First_Seen Last_Seen Data Data
UFO 2017 2017 Blah Blah
Tupac 2017 2017 Blah Blah
数据库中所需的输出:
Key First_Seen Last_Seen Data Data
Bigfoot 2015 2015 Blah Blah
Loch_Ness 2016 2016 Blah Blah
UFO 2016 2017 Blah Blah
Tupac 2017 2017 Blah Blah
答案 0 :(得分:2)
我在SQLite端做了这样的更新。
首先将DF保存为临时SQLite表:tmp
:
df.to_sql('tmp', conn, if_exists='replace')
sql = """
UPDATE table_name set last_seen = (SELECT t.last_seen
FROM tmp t
WHERE t.Key = table_name.key)
WHERE EXISTS(
SELECT *
FROM tmp
WHERE tmp.key = table_name.key
)
"""
c.execute(sql)
答案 1 :(得分:2)
根据建议,考虑SQLite中的临时表并运行UPDATE
和INSERT INTO
查询。无需遍历数百万行。
由于SQLite不支持UPDATE...JOIN
,因此需要子查询,例如IN
子句。每次运行追加查询都没有坏处,因为它只会附加新的键行。
df.to_sql('pandastable', conn, if_exists='replace')
c.execute("UPDATE finaltable f " + \
"SET f.last_seen = p.last_seen " + \
"WHERE f.[key] IN (SELECT p.[key] FROM pandastable p);")
conn.commit()
c.execute("INSERT INTO finaltable ([key], first_seen, last_seen, blah, blah, blah) " + \
"SELECT [key], first_seen, last_seen, blah, blah, blah " + \
"FROM pandastable p " + \
"WHERE NOT EXISTS " + \
" (SELECT 1 FROM finaltable sub " + \
" WHERE sub.[key] = p.[key]);")
conn.commit()
如果将pandas与SQLAlchemy连接而不是原始连接,请考虑使用事务而不是游标调用运行操作查询:
import sqlalchemy
...
engine = sqlalchemy.create_engine("sqlite:sqlite:////path/to/database.db")
df.to_sql(name='pandastable', con=engine, if_exists='replace')
# SQL ACTIONS USING TRANSACTIONS
with engine.begin() as conn:
conn.execute("UPDATE finaltable f " + \
"SET f.last_seen = p.last_seen " + \
"WHERE f.[key] IN (SELECT p.[key] FROM pandastable p);")
with engine.begin() as conn:
conn.execute("INSERT INTO finaltable ([key], first_seen, last_seen, blah, blah, blah) " + \
"SELECT [key], first_seen, last_seen, blah, blah, blah " + \
"FROM pandastable p " + \
"WHERE NOT EXISTS " + \
" (SELECT 1 FROM finaltable sub " + \
" WHERE sub.[key] = p.[key]);")
engine.dispose()