大查询SQL:确定满足条件的最小长度的时间范围

时间:2017-04-18 22:33:12

标签: sql google-bigquery

我的问题的简化版本是我有一个包含以下字段的表:id,timestamp和numeric variable(speed)。我需要确定时间段(开始和结束时间戳),其中速度的平均值小于阈值(比如2),但是时间段(结束时间戳 - 开始时间戳)至少是最小持续时间(比如5)小时或更长时间)。基本上我需要计算初始5小时窗口的平均值,如果平均值小于阈值,则保留开始时间戳并使用end_timestamp向前移动一行并重新计算平均值。如果新平均值小于阈值,则再次向前保持步骤,扩展时间窗口。如果新平均值大于阈值,则将前一个end_timestamp报告为此窗口的end_timestamp,并启动新的start_timestamp,并计算另一个5小时的新平均值。最终,最终产品是一个表,其中包含一组start_timestamps,end_timestamps(和计算的持续时间),平均速度小于2,开始和结束之间的时间至少为5小时。

我正在使用Google Big Query: 这是我到目前为止的一般结构,但似乎并没有按照我的意愿行事。首先,它只测试并报告最初5小时窗口的速度阈值......即使窗口增长。其次,它似乎没有正确地增长窗口。窗口很少超过5小时,尽管在某些情况下查看我的数据时它应该是两倍长。我希望有人试图开发类似的分析,并能说明我的错误。

SELECT
*,
LEAD(start_timestamp) OVER (PARTITION BY id ORDER BY timestamp) AS
next_start_timestamp,
LEAD(end_timestamp) OVER (PARTITION BY id ORDER BY timestamp) AS
next_end_timestamp
FROM (
SELECT
*,
IF(last_timestamp IS NULL
  OR timestamp - last_timestamp > 1000000*60*60*5, TRUE, FALSE) AS start_timestamp, #1000000*60*60*5 = 5 hours in microseconds
IF(next_timestamp IS NULL
  OR next_timestamp - timestamp > 1000000*60*60*5, TRUE, FALSE) AS end_timestamp #1000000*60*60*5 = 5 hours in microseconds
FROM (
SELECT
  *,
  LAG(timestamp,1) OVER (PARTITION BY id ORDER BY timestamp) last_timestamp,
  LEAD(timestamp,1) OVER (PARTITION BY id ORDER BY timestamp) next_timestamp,
FROM (
  SELECT
    *,
    AVG(speed) OVER (PARTITION BY id ORDER BY timestamp RANGE BETWEEN 5 * 60 * 60 * 1000000 PRECEDING AND CURRENT ROW) AS avg_speed_last_period,
  FROM (
      SELECT
        id,
        timestamp,
        speed
      FROM
        [dataset.table1]))
WHERE
  avg_speed_last_period < 2
ORDER BY
  id,
  timestamp)
HAVING
  start_timestamp
  OR end_timestamp)

编辑: 这是一些sample_data的链接。鉴于此数据以及平均速度小于2的要求至少5小时,输出表的第一行有望

 ID    start_event                   end_event             average_speed    duration_hrs
 203   2015-01-08 17:40:06 UTC    2015-01-09 07:09:35 UTC     0.7802        13.491

 203   2015-01-10 03:43:56 UTC    2015-01-10 08:48:57 UTC     1.452       5.083  

1 个答案:

答案 0 :(得分:1)

从您的CSV中,我假设低于架构

enter image description here

其中包含以下数据:

enter image description here

考虑到这一点 - 下面是BigQuery Standard SQL的工作代码 完全符合您对输出的期望

 id                 start_event                 end_event   average_speed   duration_hrs
203     2015-01-08 17:40:00 UTC   2015-01-09 07:09:00 UTC            0.78          13.48  
203     2015-01-10 03:43:00 UTC   2015-01-10 08:48:00 UTC            1.45           5.08  
    
#standardSQL
CREATE TEMPORARY FUNCTION IdentifyTimeRanges(
  items ARRAY<STRUCT<ts INT64, speed FLOAT64, datetime TIMESTAMP>>, 
  min_length INT64, threshold FLOAT64, max_speed FLOAT64
)
RETURNS ARRAY<STRUCT<start_event TIMESTAMP, end_event TIMESTAMP, average_speed FLOAT64, duration_hrs FLOAT64>>
LANGUAGE js AS """
  var result = [];
  var initial = 0;
  var candidate = items[initial].ts;
  var len = 0;
  var sum = 0;
  for (i = 0; i < items.length; i++) {
    len++;
    sum += items[i].speed

    if (items[i].ts - candidate < min_length) {
      if (items[i].speed > max_speed) {
        initial = i + 1;
        candidate = items[initial].ts;
        len = 0;
        sum = 0;
      }     
      continue;
    }

    if (sum / len > threshold || items[i].speed > max_speed) {
      avg_speed = (sum - items[i].speed) / (len - 1);
      if (avg_speed <= threshold && items[i - 1].ts - items[initial].ts >= min_length) {
        var o = [];
        o.start_event = items[initial].datetime;
        o.average_speed = avg_speed.toFixed(3);
        o.end_event = items[i - 1].datetime;
        o.duration_hrs = ((items[i - 1].ts - items[initial].ts)/60/60).toFixed(3)
        result.push(o)
      }
      initial = i;
      candidate = items[initial].ts;
      len = 1;
      sum = items[initial].speed;
    }

  };

  return result;
""";

WITH data AS (
  SELECT id, PARSE_TIMESTAMP('%m/%d/%y %H:%M', datetime) AS datetime, speed
  FROM `yourTable`
), compact_data AS (
  SELECT id, ARRAY_AGG(STRUCT<ts INT64, speed FLOAT64, datetime TIMESTAMP>(UNIX_SECONDS(datetime), speed, datetime) ORDER BY UNIX_SECONDS(datetime)) AS points
  FROM data
  GROUP BY id
)
SELECT 
  id, start_event, end_event, average_speed, duration_hrs
FROM compact_data, UNNEST(IdentifyTimeRanges(points, 5*60*60, 2, 3.1)) AS segment
ORDER BY id, start_event

请注意:此代码正在使用User-Defined Functions,这意味着您的某些limitsquotascost hit取决于您的数据大小

另外请记住 - 如果你的日期时间字段的数据类型不是STRING - 你唯一需要做的就是稍微调整一下data subquery - 剩下的应保持不变!

例如,如果datetime是TIMESTAMP数据类型 - 您只需要替换

  SELECT id, PARSE_TIMESTAMP('%m/%d/%y %H:%M', datetime) AS datetime, speed
  FROM `yourTable`

  SELECT id, datetime, speed
  FROM `yourTable`

希望你喜欢它:o)