我遇到这个问题很麻烦。我试图比较来自两个不同数据库的两个不同的表,以查看已添加的元组,已删除的元组以及已更新的元组。我使用以下代码执行此操作:
from sqlalchemy import *
# query the databases to get all tuples from the relations
# save each relation to a list in order to be able to iterate over their tuples multiple times
# iterate through the lists, hash each tuple with k, v being primary key, tuple
# iterate through the "after" relation. for each tuple in the new relation, hash its key in the "before" relation.
# If it's found and the tuple is different, consider that an update, else, do nothing.
# If it is not found, consider that an insert
# iterate through the "before" relation. for each tuple in the "before" relation, hash by the primary key
# if the tuple is found in the "after" relation, do nothing
# if not, consider that a delete.
dev_engine = create_engine('mysql://...')
prod_engine = create_engine('mysql://...')
def transactions(exchange):
dev_connect = dev_engine.connect()
prod_connect = prod_engine.connect()
get_dev_instrument = "select * from " + exchange + "_instrument;"
instruments = dev_engine.execute(get_dev_instrument)
instruments_list = [r for r in instruments]
print 'made instruments_list'
get_prod_instrument = "select * from " + exchange + "_instrument;"
instruments_after = prod_engine.execute(get_prod_instrument)
instruments_after_list = [r2 for r2 in instruments_after]
print 'made instruments after_list'
before_map = {}
after_map = {}
for row in instruments:
before_map[row['instrument_id']] = row
for y in instruments_after:
after_map[y['instrument_id']] = y
print 'formed maps'
update_count = insert_count = delete_count = 0
change_list = []
for prod_row in instruments_after_list:
result = list(prod_row)
try:
row = before_map[prod_row['instrument_id']]
if not row == prod_row:
update_count += 1
for i in range(len(row)):
if not row[i] == prod_row[i]:
result[i] = str(row[i]) + '--->' + str(prod_row[i])
result.append("updated")
change_list.append(result)
except KeyError:
insert_count += 1
result.append("inserted")
change_list.append(result)
for before_row in instruments_list:
result = before_row
try:
after_row = after_map[before_row['instrument_id']]
except KeyError:
delete_count += 1
result.append("deleted")
change_list.append(result)
for el in change_list:
print el
print "Insert: " + str(insert_count)
print "Update: " + str(update_count)
print "Delete: " + str(delete_count)
dev_connect.close()
prod_connect.close()
def main():
transactions("...")
main()
instruments
是"之前"表格和instruments_after
是"之后"表,所以我希望看到将instruments
更改为instruments_after
时发生的更改。
上述代码运行良好,但instruments
或instruments_after
非常大时失败。我有一个超过400万行的表,只是尝试将其加载到内存中导致Python退出。我尝试通过在我的查询中使用LIMIT, OFFSET
将instruments_list
附加到片段中来克服此问题,但Python仍然会退出,因为这个大小的两个列表只占用太多空间。我的最后一个选择是从一个关系中选择一个批处理,并迭代第二个关系的批处理并进行比较,但这非常容易出错。还有另一种方法来规避这个问题吗?我曾考虑为我的VM分配更多内存,但我觉得我的代码的空间复杂性是问题,那应该首先修复。