Question

我遇到了一个巨大的问题，即将时间戳状态表转换为可以快速查询的平台。

我基本上有这样一个表：

╔══════════╦═══════════╦══════════╦══════════╦═══════════╦══════════╦
║ PersonID ║ Firstname ║ Lastname ║ status   ║ startdate ║ endate   ║  
║ 10233    ║ stacy     ║ adamns   ║ active   ║ 12-23-13  ║ 02-11-14 ║  
║ 10233    ║ stacy     ║ adamns   ║ pending  ║ 02-11-14  ║ 03-09-14 ║  
║ 10233    ║ stacy     ║ adamns   ║ inactive ║ 03-09-14  ║ 12-31-99 ║  
║ 10244    ║ steve     ║ smith    ║ active   ║ 01-07-14  ║ 12-31-99 ║  
╚══════════╩═══════════╩══════════╩══════════╩═══════════╩══════════╩

并将其转换为：

╔══════════╦══════════╦═══════════╦══════════╦════════╗
║ Date     ║ PersonID ║ Firstname ║ Lastname ║ status ║
║ 12-23-13 ║ 10233    ║ stacy     ║ adamns   ║ active ║
║ 12-24-13 ║ 10233    ║ stacy     ║ adamns   ║ active ║
║ 12-25-13 ║ 10233    ║ stacy     ║ adamns   ║ active ║
║ 12-26-13 ║ 10233    ║ stacy     ║ adamns   ║ active ║
║          ║          ║           ║          ║        ║
╚══════════╩══════════╩═══════════╩══════════╩════════╝

此表有28个额外的列，其中包含描述员工的各种内容（它们是静态的，不会发生变化，例如高度），而且长度为4800万行......

我需要知道过去两年中每天有多少员工处于“活跃”状态。

现在使用较小的日期范围或数据集这很简单，我只想加入一个类似于此的日历表：

Create Table People_history as
    Select Day_id,Firstname,Lastname,status
    from People
    Join Time_calendar on day_id between startdate and endate;

我已经计算出结果表将变成78亿行和超过3个字节;但是我的数据库甚至无法完成查询，因为它耗尽了临时内存。使用光标我可以解决内存问题，但需要花费超过24小时才能运行......我只需要这样做一次所以也许这就是我要坚持做的事情，但我想我会先问你们。 / p>

我应该查看其他数据库来进行此类分析还是仅仅采用更有效的方法？

我查看了Cassandra，它建议为时间间隔或MongoDB创建列，您可以将间隔和状态投入到每个人的自己的哈希中。这些是好的选择吗？

Answer 1

Oracle forum here上的答案可能有所帮助。

在这些答案的帮助下，我提出了以下建议：

WITH date_ranges AS
         (    SELECT DISTINCT personid,
                              firstname,
                              lastname,
                              startdate + LEVEL - 1 AS date_i
                FROM myTable
          CONNECT BY LEVEL <= CEIL (endate - startdate) + 1)
  SELECT dr.date_i,
         dr.personid,
         dr.firstname,
         dr.lastname,
         (SELECT mt.status
            FROM myTable mt
           WHERE     mt.personid = dr.personid
                 AND dr.date_i BETWEEN mt.startdate AND mt.endate)
             AS status
    FROM date_ranges dr;

请进行必要的更改并相应地使用。

Answer 2

我需要知道过去两年中每天有多少员工处于“活跃”状态。

要实现目标，您无需创建78亿行表。只需使用原始表。我使用的算法可以计算平均值，按日期或月份总和只使用全表扫描。你的要求非常简单。

asume from_date为add_months(date'2014-08-05', -24)，to_date为date'2014-08-05' 试试这个

select t1.*
from t1
where ( (startdate <= date'2014-08-05' and enddate > date'2014-08-05')
      or (startdate <=  and enddate > add_months(date'2014-08-05', -24))
      or (startdate >= add_months(date'2014-08-05', -24) and enddate < date'2014-08-05' ) )

然后您可以在2年内获得所有用户状态。这个陈述只需要一个完整的扫描，48milion行表应该在几分钟内完成。

添加状态过滤条件并区分personid，然后您就可以获得所需的结果。

select distinct t1.personid,...
from t1
where ( (startdate <= date'2014-08-05' and enddate > date'2014-08-05')
      or (startdate <=  and enddate > add_months(date'2014-08-05', -24))
      or (startdate >= add_months(date'2014-08-05', -24) and enddate < date'2014-08-05' ) )
     and status = 'active'

<强>更新
根据OP的要求how many employees were in the state of "active" for each day for the past 2 years，我之前的解决方案错过了each day要求。要弄清楚州是否保持两年，应该计算一个州的持续时间。

计算状态的持续时间：

with temp as (select t1.*
from t1
where ( (startdate <= date'2014-08-05' and enddate > date'2014-08-05')
      or (startdate <= add_months(date'2014-08-05', -24) and enddate > add_months(date'2014-08-05', -24))
      or (startdate >= add_months(date'2014-08-05', -24) and enddate < date'2014-08-05' ) )
  and status = 2)
select temp.id,status,
sum(case when enddate < date'2014-08-05' 
      then enddate 
      else date'2014-08-05' 
    end
  - case when startdate > add_months(date'2014-08-05', -24) 
      then startdate 
      else add_months(date'2014-08-05', -24) 
    end) as duration
from temp
group by temp.id,status

然后过滤持续时间等于2年的持续时间，达到目标。

having 
sum(case when enddate < date'2014-08-05' 
      then enddate 
      else date'2014-08-05' 
    end
  - case when startdate > add_months(date'2014-08-05', -24) 
      then startdate 
      else add_months(date'2014-08-05', -24) 
    end) = date'2014-08-05' - add_months(date'2014-08-05', -24)

据我所知，这是最有效的方式。希望它有所帮助。

关于那些日期比较条件的注意事项。我构建了一个Sql Fiddle来帮助您进行测试。

在Oracle中对一个时间序列表进行非规范化和密集化

2 个答案: