说明：

Question

我正在考虑通过直接使用pandas或sql来按时间范围重新组织和聚合数据的有效方法。

例如，我在一个节点上的MySQL数据库记录作业中有一些数据：

job_id, time_start, time_end
1,      00:00,      04:00
2,      02:00,      05:00
3,      06:00,      07:00

我想应用一些操作并获得这样的表格（时间在最早的工作开始和最新的工作结束范围内分开，计数意味着在该范围内，有多少工作是活跃的）：

time_start, time_end, count of active jobs
00:00,      02:00,    1
02:00,      04:00,    2
04:00,      05:00,    1
06:00,      07:00,    1

或像这样的表（count表示活动作业的数量，time_duration表示此节点有count个活动作业的时长：

time_duration, count of active jobs
4hrs,           1
2hrs,           2

我能想到的只是维护一个字典变量并遍历原始表中的所有行。

Answer 1

SQL Fiddle Demo

假设：您将使用date/timestamp列而不仅仅是字符串，如果不使用正确的格式，它可能会表现不正确。

必填查询

select t3.range_start, t3.range_end,count(*) as count_of_act_job
from your_table t1
cross join
(
    select time_start as range_start ,time_end as range_end from ( select
    t2.*, @next as time_end , @next := time_start
    from
    (   select time_start from your_table
        union 
        select time_end as time_start from your_table
    ) t2
    , (select @next := null) var_init
    order by time_start desc
    ) sq
    where time_start<>time_end
) t3
where  t3.range_start>=t1.time_start and t3.range_end<=t1.time_end
group by t3.range_start,t3.range_end
order by range_start

输出：

+-------------+-----------+------------------+
| range_start | range_end | count_of_act_job |
+-------------+-----------+------------------+
| 00:00       | 02:00     |                1 |
| 02:00       | 04:00     |                2 |
| 04:00       | 05:00     |                1 |
| 06:00       | 07:00     |                1 |
+-------------+-----------+------------------+

说明：

Step1：t2查询;工会以获得所有可能的时间。

select time_start from your_table
union 
select time_end as time_start from your_table;

输出

+------------+
| time_start |
+------------+
| 00:00      |
| 02:00      |
| 06:00      |
| 04:00      |
| 05:00      |
| 07:00      |
+------------+

Step2：t3查询;使用变量来复制lead函数。它将提供Range的所有可能组合。

select time_start as range_start ,time_end as range_end from ( select
t2.*, @next as time_end , @next := time_start
from
(   select time_start from your_table
    union 
    select time_end as time_start from your_table
) t2
, (select @next := null) var_init
order by time_start desc
) sq
where time_start<>time_end;

输出：

+-------------+-----------+
| range_start | range_end |
+-------------+-----------+
| 06:00       | 07:00     |
| 05:00       | 06:00     |
| 04:00       | 05:00     |
| 02:00       | 04:00     |
| 00:00       | 02:00     |
+-------------+-----------+

步骤3。通过将your_table与t3交叉连接来获取所有不同的范围。 where子句将过滤不需要的记录。

select t3.range_start, t3.range_end
from your_table t1
cross join
(
    select time_start as range_start ,time_end as range_end from ( select
    t2.*, @next as time_end , @next := time_start
    from
    (   select time_start from your_table
        union 
        select time_end as time_start from your_table
    ) t2
    , (select @next := null) var_init
    order by time_start desc
    ) sq
    where time_start<>time_end
) t3
where  t3.range_start>=t1.time_start and t3.range_end<=t1.time_end;

输出：

+-------------+-----------+
| range_start | range_end |
+-------------+-----------+
| 06:00       | 07:00     |
| 04:00       | 05:00     |
| 02:00       | 04:00     |
| 02:00       | 04:00     |
| 00:00       | 02:00     |
+-------------+-----------+

步骤4。在Count(*)之后使用Group By来获取所需的结果

按计数和时间范围重新组织和汇总数据

1 个答案:

说明：