加速python w / sqlalchemy功能

时间:2017-07-28 21:50:36

标签: python performance sqlalchemy

我有一个使用python和sqlalchemy填充数据库表的函数。该功能现在运行相当缓慢,大约需要17分钟。我认为主要的问题是我循环遍历两组大数据来构建新表。我在下面的代码中包含了记录计数。

我怎样才能加快速度?我应该尝试将嵌套的for循环转换为一个大的sqlalchemy查询吗?我用pycharm描述了这个函数,但我不确定我是否完全理解结果。

def populate(self):
    """Core function to populate positions."""

    # get raw annotations with tag Org
    # returns 11,659 records
    organizations = model.session.query(model.Annotation) \
        .filter(model.Annotation.tag == 'Org')\
        .filter(model.Annotation.organization_id.isnot(None)).all()

    # get raw annotations with tags Support or Oppose
    # returns 2,947 records
    annotations = model.session.query(model.Annotation) \
        .filter((model.Annotation.tag == 'Support') | (model.Annotation.tag == 'Oppose')).all()

    for org in organizations:
        for anno in annotations:

            # Org overlaps with Support or Oppose tag
            # start and end columns are integers
            if org.start >= anno.start and org.end <= anno.end:
                position = model.Position()
                # set to de-duplicated organization
                position.organization_id = org.organization_id
                position.disposition = anno.tag
                # look up bill_id from document_bill table
                document = model.session.query(model.document_bill)\
                    .filter_by(document_id=anno.document_id).first()
                position.bill_id = document.bill_id
                position.document_id = anno.document_id
                model.session.add(position)
                logging.info('org: {}, disposition: {}, bill: {}'.format(
                    position.organization_id, position.disposition, position.bill_id)
                )
                continue
        logging.info('committing to database')
        model.session.commit()

1 个答案:

答案 0 :(得分:0)

我的投注,按概率递减的顺序:

  • 自动提交已开启,因此您正在等待磁盘。
  • 循环内的查询&#34; document = model.session.query(model.document_bill)....&#34;很慢(使用EXPLAIN ANALYZE)。
  • 大部分时间实际上是花在内循环中打印日志到终端(你应该分析)
  • model.session.add(位置)很慢(不知道那是做什么)
  • (这个应该真的是第一个)像INSERT INTO SELECT这样的SQL查询可以在几十毫秒内完成吗?如果是这样,为什么要在应用程序中进行循环?...