气流ORM查询优化

时间:2019-05-22 13:40:28

标签: python mysql sqlalchemy airflow

我正在使用AirFlow来调度作业,但是它变得比以前慢,尤其是对于task_stat中的views.py方法,我有400多个dags,其中有300万行表格1}}。 我必须等待40秒钟以上才能获得task_instance的响应,有什么方法可以优化此方法?

task_statunion_all()中的RunningTI是最慢的一个,如果删除LastTI并仅在合并结果时保留RunningTI,我可以在5秒钟内得到响应,但是LastTI对于前端显示详细信息是必需的。

是否可以优化此查询?该数据库是MySQL。

task_stat方法:

RunningTI

相关型号:

@expose('/task_stats')
@login_required
@provide_session
def task_stats(self, session=None):
    TI = models.TaskInstance
    DagRun = models.DagRun
    Dag = models.DagModel

    LastDagRun = (
        session.query(DagRun.dag_id, sqla.func.max(DagRun.execution_date).label('execution_date'))
            .join(Dag, Dag.dag_id == DagRun.dag_id)
            .filter(DagRun.state != State.RUNNING)
            .filter(Dag.is_active == True)  # noqa: E712
            .filter(Dag.is_subdag == False)  # noqa: E712
            .group_by(DagRun.dag_id)
            .subquery('last_dag_run')
    )
    RunningDagRun = (
        session.query(DagRun.dag_id, DagRun.execution_date)
            .join(Dag, Dag.dag_id == DagRun.dag_id)
            .filter(DagRun.state == State.RUNNING)
            .filter(Dag.is_active == True)  # noqa: E712
            .filter(Dag.is_subdag == False)  # noqa: E712
            .subquery('running_dag_run')
    )

    # Select all task_instances from active dag_runs.
    # If no dag_run is active, return task instances from most recent dag_run.
    LastTI = (
        session.query(TI.dag_id.label('dag_id'), TI.state.label('state'))
        .join(LastDagRun, and_(
            LastDagRun.c.dag_id == TI.dag_id,
            LastDagRun.c.execution_date == TI.execution_date))
    )
    RunningTI = (
        session.query(TI.dag_id.label('dag_id'), TI.state.label('state'))
        .join(RunningDagRun, and_(
            RunningDagRun.c.dag_id == TI.dag_id,
            RunningDagRun.c.execution_date == TI.execution_date))
    )

    UnionTI = union_all(LastTI, RunningTI).alias('union_ti')
    # if I remove RunningTi in union_all(), and change line below to
    # UnionTI = union_all(LastTI).alias('union_ti'), it could save a lot of time

    qry = (
        session.query(UnionTI.c.dag_id, UnionTI.c.state, sqla.func.count())
        .group_by(UnionTI.c.dag_id, UnionTI.c.state)
    )

    data = {}
    for dag_id, state, count in qry:
        if dag_id not in data:
            data[dag_id] = {}
        data[dag_id][state] = count
    session.commit()

指向github的链接:https://github.com/apache/airflow/blob/master/airflow/www/views.py

0 个答案:

没有答案