我的问题的简化版本是我有一个包含以下字段的表:id,timestamp和numeric variable(speed)。我需要确定时间段(开始和结束时间戳),其中速度的平均值小于阈值(比如2),但是时间段(结束时间戳 - 开始时间戳)至少是最小持续时间(比如5)小时或更长时间)。基本上我需要计算初始5小时窗口的平均值,如果平均值小于阈值,则保留开始时间戳并使用end_timestamp向前移动一行并重新计算平均值。如果新平均值小于阈值,则再次向前保持步骤,扩展时间窗口。如果新平均值大于阈值,则将前一个end_timestamp报告为此窗口的end_timestamp,并启动新的start_timestamp,并计算另一个5小时的新平均值。最终,最终产品是一个表,其中包含一组start_timestamps,end_timestamps(和计算的持续时间),平均速度小于2,开始和结束之间的时间至少为5小时。
我正在使用Google Big Query: 这是我到目前为止的一般结构,但似乎并没有按照我的意愿行事。首先,它只测试并报告最初5小时窗口的速度阈值......即使窗口增长。其次,它似乎没有正确地增长窗口。窗口很少超过5小时,尽管在某些情况下查看我的数据时它应该是两倍长。我希望有人试图开发类似的分析,并能说明我的错误。
SELECT
*,
LEAD(start_timestamp) OVER (PARTITION BY id ORDER BY timestamp) AS
next_start_timestamp,
LEAD(end_timestamp) OVER (PARTITION BY id ORDER BY timestamp) AS
next_end_timestamp
FROM (
SELECT
*,
IF(last_timestamp IS NULL
OR timestamp - last_timestamp > 1000000*60*60*5, TRUE, FALSE) AS start_timestamp, #1000000*60*60*5 = 5 hours in microseconds
IF(next_timestamp IS NULL
OR next_timestamp - timestamp > 1000000*60*60*5, TRUE, FALSE) AS end_timestamp #1000000*60*60*5 = 5 hours in microseconds
FROM (
SELECT
*,
LAG(timestamp,1) OVER (PARTITION BY id ORDER BY timestamp) last_timestamp,
LEAD(timestamp,1) OVER (PARTITION BY id ORDER BY timestamp) next_timestamp,
FROM (
SELECT
*,
AVG(speed) OVER (PARTITION BY id ORDER BY timestamp RANGE BETWEEN 5 * 60 * 60 * 1000000 PRECEDING AND CURRENT ROW) AS avg_speed_last_period,
FROM (
SELECT
id,
timestamp,
speed
FROM
[dataset.table1]))
WHERE
avg_speed_last_period < 2
ORDER BY
id,
timestamp)
HAVING
start_timestamp
OR end_timestamp)
编辑: 这是一些sample_data的链接。鉴于此数据以及平均速度小于2的要求至少5小时,输出表的第一行有望
ID start_event end_event average_speed duration_hrs
203 2015-01-08 17:40:06 UTC 2015-01-09 07:09:35 UTC 0.7802 13.491
203 2015-01-10 03:43:56 UTC 2015-01-10 08:48:57 UTC 1.452 5.083
答案 0 :(得分:1)
从您的CSV中,我假设低于架构
其中包含以下数据:
考虑到这一点 - 下面是BigQuery Standard SQL的工作代码 完全符合您对输出的期望
id start_event end_event average_speed duration_hrs
203 2015-01-08 17:40:00 UTC 2015-01-09 07:09:00 UTC 0.78 13.48
203 2015-01-10 03:43:00 UTC 2015-01-10 08:48:00 UTC 1.45 5.08
#standardSQL
CREATE TEMPORARY FUNCTION IdentifyTimeRanges(
items ARRAY<STRUCT<ts INT64, speed FLOAT64, datetime TIMESTAMP>>,
min_length INT64, threshold FLOAT64, max_speed FLOAT64
)
RETURNS ARRAY<STRUCT<start_event TIMESTAMP, end_event TIMESTAMP, average_speed FLOAT64, duration_hrs FLOAT64>>
LANGUAGE js AS """
var result = [];
var initial = 0;
var candidate = items[initial].ts;
var len = 0;
var sum = 0;
for (i = 0; i < items.length; i++) {
len++;
sum += items[i].speed
if (items[i].ts - candidate < min_length) {
if (items[i].speed > max_speed) {
initial = i + 1;
candidate = items[initial].ts;
len = 0;
sum = 0;
}
continue;
}
if (sum / len > threshold || items[i].speed > max_speed) {
avg_speed = (sum - items[i].speed) / (len - 1);
if (avg_speed <= threshold && items[i - 1].ts - items[initial].ts >= min_length) {
var o = [];
o.start_event = items[initial].datetime;
o.average_speed = avg_speed.toFixed(3);
o.end_event = items[i - 1].datetime;
o.duration_hrs = ((items[i - 1].ts - items[initial].ts)/60/60).toFixed(3)
result.push(o)
}
initial = i;
candidate = items[initial].ts;
len = 1;
sum = items[initial].speed;
}
};
return result;
""";
WITH data AS (
SELECT id, PARSE_TIMESTAMP('%m/%d/%y %H:%M', datetime) AS datetime, speed
FROM `yourTable`
), compact_data AS (
SELECT id, ARRAY_AGG(STRUCT<ts INT64, speed FLOAT64, datetime TIMESTAMP>(UNIX_SECONDS(datetime), speed, datetime) ORDER BY UNIX_SECONDS(datetime)) AS points
FROM data
GROUP BY id
)
SELECT
id, start_event, end_event, average_speed, duration_hrs
FROM compact_data, UNNEST(IdentifyTimeRanges(points, 5*60*60, 2, 3.1)) AS segment
ORDER BY id, start_event
请注意:此代码正在使用User-Defined Functions
,这意味着您的某些limits
,quotas
和cost hit
取决于您的数据大小
另外请记住 - 如果你的日期时间字段的数据类型不是STRING - 你唯一需要做的就是稍微调整一下data subquery
- 剩下的应保持不变!
例如,如果datetime是TIMESTAMP数据类型 - 您只需要替换
SELECT id, PARSE_TIMESTAMP('%m/%d/%y %H:%M', datetime) AS datetime, speed
FROM `yourTable`
带
SELECT id, datetime, speed
FROM `yourTable`
希望你喜欢它:o)