我正在考虑通过直接使用pandas或sql来按时间范围重新组织和聚合数据的有效方法。
例如,我在一个节点上的MySQL数据库记录作业中有一些数据:
job_id, time_start, time_end
1, 00:00, 04:00
2, 02:00, 05:00
3, 06:00, 07:00
我想应用一些操作并获得这样的表格(时间在最早的工作开始和最新的工作结束范围内分开,计数意味着在该范围内,有多少工作是活跃的):
time_start, time_end, count of active jobs
00:00, 02:00, 1
02:00, 04:00, 2
04:00, 05:00, 1
06:00, 07:00, 1
或像这样的表(count表示活动作业的数量,time_duration表示此节点有count
个活动作业的时长:
time_duration, count of active jobs
4hrs, 1
2hrs, 2
我能想到的只是维护一个字典变量并遍历原始表中的所有行。
答案 0 :(得分:1)
假设:您将使用date/timestamp
列而不仅仅是字符串,如果不使用正确的格式,它可能会表现不正确。
必填查询
select t3.range_start, t3.range_end,count(*) as count_of_act_job
from your_table t1
cross join
(
select time_start as range_start ,time_end as range_end from ( select
t2.*, @next as time_end , @next := time_start
from
( select time_start from your_table
union
select time_end as time_start from your_table
) t2
, (select @next := null) var_init
order by time_start desc
) sq
where time_start<>time_end
) t3
where t3.range_start>=t1.time_start and t3.range_end<=t1.time_end
group by t3.range_start,t3.range_end
order by range_start
输出:
+-------------+-----------+------------------+
| range_start | range_end | count_of_act_job |
+-------------+-----------+------------------+
| 00:00 | 02:00 | 1 |
| 02:00 | 04:00 | 2 |
| 04:00 | 05:00 | 1 |
| 06:00 | 07:00 | 1 |
+-------------+-----------+------------------+
Step1:t2查询;工会以获得所有可能的时间。
select time_start from your_table
union
select time_end as time_start from your_table;
输出
+------------+
| time_start |
+------------+
| 00:00 |
| 02:00 |
| 06:00 |
| 04:00 |
| 05:00 |
| 07:00 |
+------------+
Step2:t3查询;使用变量来复制lead
函数。它将提供Range的所有可能组合。
select time_start as range_start ,time_end as range_end from ( select
t2.*, @next as time_end , @next := time_start
from
( select time_start from your_table
union
select time_end as time_start from your_table
) t2
, (select @next := null) var_init
order by time_start desc
) sq
where time_start<>time_end;
输出:
+-------------+-----------+
| range_start | range_end |
+-------------+-----------+
| 06:00 | 07:00 |
| 05:00 | 06:00 |
| 04:00 | 05:00 |
| 02:00 | 04:00 |
| 00:00 | 02:00 |
+-------------+-----------+
步骤3。通过将your_table与t3交叉连接来获取所有不同的范围。 where子句将过滤不需要的记录。
select t3.range_start, t3.range_end
from your_table t1
cross join
(
select time_start as range_start ,time_end as range_end from ( select
t2.*, @next as time_end , @next := time_start
from
( select time_start from your_table
union
select time_end as time_start from your_table
) t2
, (select @next := null) var_init
order by time_start desc
) sq
where time_start<>time_end
) t3
where t3.range_start>=t1.time_start and t3.range_end<=t1.time_end;
输出:
+-------------+-----------+
| range_start | range_end |
+-------------+-----------+
| 06:00 | 07:00 |
| 04:00 | 05:00 |
| 02:00 | 04:00 |
| 02:00 | 04:00 |
| 00:00 | 02:00 |
+-------------+-----------+
步骤4。在Count(*)
之后使用Group By
来获取所需的结果