从具有日期范围和多个联接的查询生成时间列表

时间:2014-01-21 15:53:58

标签: mysql sql sqlalchemy

为了解决复杂的SQL查询而苦苦挣扎。

这是一个包含表格/数据http://sqlfiddle.com/#!2/7de65

的sqlfiddle

如果我解释表格正在做什么,可能会更有意义;

时刻表是火车时刻表的列表,呼叫是火车将通过时所订购的时刻表的呼叫点列表,当确认列车将要运行 1时创建激活/ sup>并且当列车在指定的呼叫点上移动时创建一个移动。

调用通过calling.sid与调度相关联。激活通过activations.sid与计划相关联。移动通过运动活动与激活相关联,并通过movement.calling_id进行调用。

现在实际问题;

我想生成每分钟有效的列车列表。如果

,列车被认为是活动的
  • 它至少有一个与它相关的动作(I.E.没有离开它的原点)
  • 它没有与其最终通话点相关的动作
  • 它在不到24小时前被激活

如果符合所有这些标准,列车应始终被视为有效,因此列在计数中。

根据上述数字小车中的数据,火车在14:20离开它的第一个呼叫点,并在15:04到达它的最后一个呼叫点,它应该包括在每分钟14:20-15:04。我想知道是否有人可以阐明如何做到这一点。我不认为自己是一名SQL专家(可能为什么我会挣扎,我实际上并不认为自己模糊不清,但这是一个不同的问题,或者可能是相同的,我'我不确定。)

我开始走这条路了

SELECT
    YEAR( activations.activated ),
    MONTH( activations.activated ),
    DAY( activations.activated ),
    HOUR( activations.activated ),
    MINUTE( activations.activated ),
    count(activations.id)
FROM activations, movement, calling, schedules 
WHERE activations.id = movement.activation AND movement.calling_id = calling.id AND schedules.id = activations.sid
GROUP BY DAYOFYEAR( activations.activated ) , HOUR( activations.activated ), MINUTE(activations.activated )

但我知道这是错的,因为火车只会被列出一次,不管它被激活了多长时间。

我还考虑过直接在Python中使用指定时间段的每一分钟进行循环,它有点像这样但它超级慢(在24小时结果时以分钟分辨率获得活动列车在1440查询中,未完全优化)。所以我认为以太必须是一些聪明的分组,或SQL中的某种循环,但我不知道如何做以太。

因此,如果我运行14:18到15:07的查询,我会得到类似

的内容
+-----------------+------------------+
| Timestamp       | Active services  |
+-----------------+------------------+
| 14:18 1/1/2014  | 0                |
| 14:19 1/1/2014  | 0                |
| 14:20 1/1/2014  | 1                |
| 14:21 1/1/2014  | 1                |
| 14:22 1/1/2014  | 1                |
[...
Identical record for every minute through to
    ...]
| 15:03 1/1/2014  | 1                |
| 15:04 1/1/2014  | 1                |
| 15:05 1/1/2014  | 0                |
| 15:06 1/1/2014  | 0                |
| 15:07 1/1/2014  | 0                |
+-----------------+------------------+

(只要我以后可以解析时间戳的格式并不重要)

在我的脑海中,我可以看到它有点像这样工作(伪代码)

while time is between report_start_date and report_end_date:
    records = count(
        activations where number of movements(
            movement.actual < time
        ) > 0 //Number of movements created before current minute
            and
        movement.calling_id = calling_points(
            actual < minute
        ).last.id does not exist //As of this minute doesn't have a movement for last calling point
            and
        activations.activated > now - 24 hours //Was activated less than 24 hours ago
    )
    result timestamp, records
    time + 1 minute

我几乎将记录= count()位排序,它只是以太循环或按时间分组我不确定。我可以按照第一个移动记录的日期进行分组,但记录只会在第一分钟显示。我希望它能为它活跃的每一分钟展示。


奖励积分

我实际上是尝试在SQLAlchemy中实现它(因此标记),我试图在将其移植到SQLAlchemy查询之前尝试使用SQL的基础知识,但是如果你可以在SQL中执行它< em>和 SQLAlchemy / Python你会得到一些东西,我还不确定它是什么,它可能是假设的。


  1. 在真正了解这些内容的任何人批评我之前,激活并不会确认列车会运行,但它足够接近我当前的目的。我的最终查询将排除取消和内容,但我只是想先了解基础知识。

1 个答案:

答案 0 :(得分:0)

为了为每一分钟生成一些结果,我不会依赖于每个可能的分钟都是数据库中某个表中的值的事实。出于这个原因,我实际上会在数据库中创建一个“静态”表,它只存储那些时间戳,我们将从那里开始构建一个查询。我做了以下事情:

CREATE TABLE "static_time" (
    "yyyymmddhhmm" datetime NOT NULL,
    PRIMARY KEY ("yyyymmddhhmm")
);

注意:对于我使用sqlite数据库的所有测试,您可能需要在某些地方更改以使用相应的mysql构造。

我还添加了为期2天的所有数据进行测试。您可能应该执行相同的操作,从您希望运行第一个分析到将来的某个重要年份(例如:2050-12-31T23:59:00)。我使用sqlalchemy做到了这一点,但我确信使用某个函数或循环直接执行此操作是有意义的:

class StaticTime(Base):
    __tablename__ = 'static_time'
    __table_args__ = ({'autoload': True, },)

# ...

def populate_static_time():
    print "Adding static times"
    sdt = datetime(2014, 1, 1)
    edt = sdt + timedelta(days=2)
    cdt = sdt
    while cdt <= edt:
        session.add(StaticTime(yyyymmddhhmm = cdt))
        cdt += timedelta(minutes=1)
    session.commit()
populate_static_time()

此外,我假设你的SA模型包括如下定义的关系:

# MODEL
class Schedule(Base):
    __tablename__ = 'schedules'
    __table_args__ = ({'autoload': True, },)


class Calling(Base):
    __tablename__ = 'calling'
    __table_args__ = ({'autoload': True, },)


class Activation(Base):
    __tablename__ = 'activations'
    __table_args__ = ({'autoload': True, },)

    # relationships:
    schedule = relationship("Schedule")


class Movement(Base):
    __tablename__ = 'movement'
    __table_args__ = ({'autoload': True, },)

    # relationships:
    # @note: use activation_rel as activation is column name
    activation_rel = relationship("Activation", backref="movements")

现在,让我们构建查询:

# 0. start with all times and proper counting (columns in SELECT)
q = session.query(
        StaticTime.yyyymmddhhmm.label("yyyymmddhhmm"),
        func.count(Activation.id.distinct()).label("count"),
    )

# 1. join on the trains which are active (or finished, which will be excluded later)
q = q.filter(Activation.movements.any(Movement.actual < StaticTime.yyyymmddhhmm))

# 2. join on the trains which are not finished (or no rows for those that did not)
# 2.a) subquery to get the "last" calling per sid
last_calling_sqry = (session.query(
        Calling.sid.label("sid"),
        func.max(Calling.id).label("max_calling_id"),
    )
    .group_by(Calling.sid)
).subquery("xxx")

# 2.b) subquery to find the movement for the "last" colling
train_done_at_sqry = (session.query(
        Activation.id.label("activation_id"),
        Movement.actual.label("arrived_time"),
    )
    .join(last_calling_sqry, Activation.sid == last_calling_sqry.c.sid)
    .join(Movement, and_(
            Movement.calling_id == last_calling_sqry.c.max_calling_id,
            Movement.activation == Activation.id,
        ))
).subquery("yyy")

# 2.c) lets use it now
q = q.outerjoin(train_done_at_sqry,
        train_done_at_sqry.c.activation_id == Activation.id,
    )
# 2.d) only those that arrived "after" currently tested time
q = q.filter(train_done_at_sqry.c.arrived_time >= StaticTime.yyyymmddhhmm)


# 3. add filter to use only those trains that started in last 24 hours
# @note: do not need this in case when step-X is used as well as it filters
# @TODO: replace func.date(...) with MYSQL version
q = q.filter(Activation.activated >= func.date("now", "-1 days"))

# 4. filter and group by
q = q.group_by(StaticTime.yyyymmddhhmm)
q = q.order_by(StaticTime.yyyymmddhhmm)

# @NOTE: at this point "q" will return only those minutes which have at least 1 active train

# X. FINALLY: WRAP AGAIN TO HAVE ALL MINUTES (also those with no active trains)
sub = q.subquery("sub")
w = session.query(
        StaticTime.yyyymmddhhmm.label("Timestamp"),
        func.ifnull(sub.c.count, 0).label("Active Services")
        )
w = w.outerjoin(sub, sub.c.yyyymmddhhmm == StaticTime.yyyymmddhhmm)
# @TODO: replace func.date(...) with MYSQL version
w = w.filter(Activation.activated >= func.date("now", "-1 days"))

for a in w:
    print a

这是一个相当复杂的查询,只给出您提供的数据,很难测试不同的场景。但希望您能够与当前结果进行比较,代码将为您提供有关如何完成此操作的一些提示。另外,我可能在某些地方加入了错误的列(actual vs planned)。再次,这可能不适用于mysql(我没有它,也不太了解它。)

奖励(颠倒): w sqlite查询生成的SQL语句。您可能会发现从原始SQL开始更容易,并逐渐转向sqlalchemy

SELECT static_time.yyyymmddhhmm AS "Timestamp", ifnull(sub.count, ?) AS "Active Services"
FROM static_time 
LEFT OUTER JOIN (
    SELECT static_time.yyyymmddhhmm AS yyyymmddhhmm, count(DISTINCT activations.id) AS count
    FROM activations, static_time 
    LEFT OUTER JOIN (
        SELECT activations.id AS activation_id, movement.actual AS arrived_time
        FROM activations 
        JOIN (
            SELECT calling.sid AS sid, max(calling.id) AS max_calling_id
            FROM calling
            GROUP BY calling.sid
            ) AS xxx 
            ON activations.sid = xxx.sid 
        JOIN movement 
            ON movement.calling_id = xxx.max_calling_id AND movement.activation = activations.id
        ) AS yyy 
        ON yyy.activation_id = activations.id
    WHERE (EXISTS (SELECT 1
        FROM movement
        WHERE activations.id = movement.activation AND movement.actual < static_time.yyyymmddhhmm)
        )
    AND yyy.arrived_time >= static_time.yyyymmddhhmm 
    GROUP BY static_time.yyyymmddhhmm 
    ORDER BY static_time.yyyymmddhhmm
    ) AS sub 
        ON sub.yyyymmddhhmm = static_time.yyyymmddhhmm
WHERE static_time.yyyymmddhhmm >= ? AND static_time.yyyymmddhhmm <= ?

PARAMS: (0, '2014-01-01 14:15:00.000000', '2014-01-01 15:10:00.000000')