我遇到了性能瓶颈问题,试图同时更新两个表中的多个记录。目前,我有一些Pandas DataFrames(new_records_df
和modified_records_df
)包含我想要插入/更新的记录。请参阅以下psuedocode:
if not new_records_df.empty:
new_recs_data = new_records_df.T.to_dict().values() # creates a list of dictionaries from the DataFrame
new_recs = []
for r in new_recs_data:
new_rec = {'foo_id': foo_id,
'bar': bar}
new_recs.append(new_rec)
db_session.bulk_insert_mappings(Record, new_recs, return_defaults=True) # return_defaults inserts the id of the inserted record into the dictionary object
new_related_recs = []
for nr in new_recs:
new_related_rec = {'rec_id': nr['id'],
'baz': baz}
new_related_recs.append(new_rec)
db_session.bulk_insert_mappings(RelatedRec, new_related_recs)
if not modified_records_df.empty:
modified_rec_data = modified_records_df.T.to_dict().values() # again, converting teh DataFrame to a list of dicts
modified_recs = []
for m in modified_rec_data:
modified_rec = {'id': m['id'],
'zab': zab}
modified_recs.append(modified_rec)
db_session.bulk_update_mappings(RelatedRec, modified_recs) # when a record is modified, only the RelatedRec object is updated. The Record object already exists and stays unmodified
问题是,对于~8k记录,字典上的循环需要大约20秒,而实际的数据库插入/更新只需要大约4秒。我希望有一种聪明的方法可以消除for
循环,因为这似乎是瓶颈。我的数据库是postgres,我的驱动程序是psycoppg2 2.6.2