在bigquery

时间:2019-04-26 09:23:51

标签: google-bigquery

每小时记录一次设备的能耗:

+--------------+-----------+-----------------------+
| energy_usage | device_id |  timestamp            |
+--------------+-----------+-----------------------+
| 10           | 1         |  2019-02-12T01:00:00  |
| 16           | 2         |  2019-02-12T01:00:00  |
| 26           | 1         |  2019-03-12T02:00:00  |
| 24           | 2         |  2019-03-12T02:00:00  |
+--------------+-----------+-----------------------+

我的目标是:

  1. 创建两列,一列用于energy_usage_day(上午8点至晚上8点),另一列用于energy_usage_night(晚上8点至上午8点)
  2. 创建每月总计,按device_id分组并汇总能源使用量

所以结果可能像这样:

+--------------+------------------+--------------------+-----------+---------+------+
| energy_usage | energy_usage_day | energy_usage_night | device_id |  month  | year |
+--------------+------------------+--------------------+-----------+---------+------+
| 80           | 30               | 50                 | 1         | 2       | 2019 |
| 130          | 60               | 70                 | 2         | 3       | 2019 |
+--------------+------------------+--------------------+-----------+---------+------+

以下查询会产生以下结果:

SELECT SUM(energy_usage) energy_usage
  , SUM(IF(EXTRACT(HOUR FROM timestamp) BETWEEN 8 AND 19, energy_usage, 0)) energy_usage_day
  , SUM(IF(EXTRACT(HOUR FROM timestamp) NOT BETWEEN 8 AND 19, energy_usage, 0)) energy_usage_night
  , device_id
  , EXTRACT(MONTH FROM timestamp) month, EXTRACT(YEAR FROM timestamp) year
FROM `data`
GROUP BY device_id, month, year

说我只对超过特定阈值的能源使用总量感兴趣,例如50.我想以50的总能源使用量开始SUM。结果应如下所示:

+--------------+------------------+--------------------+-----------+---------+------+
| energy_usage | energy_usage_day | energy_usage_night | device_id |  month  | year |
+--------------+------------------+--------------------+-----------+---------+------+
| 30           | 10               | 20                 | 1         | 2       | 2019 |
| 80           | 50               | 30                 | 2         | 3       | 2019 |
+--------------+------------------+--------------------+-----------+---------+------+

换句话说:仅当energy_usage达到阈值50时,查询才应开始汇总energy_usage,energy_usage_day和energy_usage_night。

bigquery有可能吗?

1 个答案:

答案 0 :(得分:1)

以下是用于BigQuery标准SQL的逻辑,其逻辑是仅在达到50(每个设备每月)后才开始汇总使用量

    
#standardSQL
WITH temp AS (
  SELECT *, SUM(energy_usage) OVER(win) > 50 qualified,
    EXTRACT(HOUR FROM `timestamp`) BETWEEN 8 AND 20 day_hour,
    EXTRACT(MONTH FROM `timestamp`) month, 
    EXTRACT(YEAR FROM `timestamp`) year    
  FROM `project.dataset.table`
  WINDOW win AS (PARTITION BY device_id, TIMESTAMP_TRUNC(`timestamp`, MONTH) ORDER BY `timestamp`)
)
SELECT SUM(energy_usage) energy_usage,
  SUM(IF(day_hour, energy_usage, 0)) energy_usage_day,
  SUM(IF(NOT day_hour, energy_usage, 0)) energy_usage_night,
  device_id,
  month, 
  year
FROM temp
WHERE qualified
GROUP BY device_id, month, year   
  

假设当前使用情况的SUM为49,下一个使用情况项的值为2。SUM为51。因此,使用情况2将添加到SUM中。相反,只应添加1的一半。我们可以在BigQuery SQL中解决此类问题吗?

#standardSQL
WITH temp AS (
  SELECT *, SUM(energy_usage) OVER(win) > 50 qualified,
    SUM(energy_usage) OVER(win) - 50 rolling_sum,
    EXTRACT(HOUR FROM `timestamp`) BETWEEN 8 AND 20 day_hour,
    EXTRACT(MONTH FROM `timestamp`) month, 
    EXTRACT(YEAR FROM `timestamp`) year    
  FROM `project.dataset.table`
  WINDOW win AS (PARTITION BY device_id, TIMESTAMP_TRUNC(`timestamp`, MONTH) ORDER BY `timestamp`)
), temp_with_adjustments AS (
  SELECT *, 
    IF(
      ROW_NUMBER() OVER(PARTITION BY device_id, month, year ORDER BY `timestamp`) = 1, 
      rolling_sum, 
      energy_usage
    ) AS adjusted_energy_usage
  FROM temp 
  WHERE qualified
)
SELECT SUM(adjusted_energy_usage) energy_usage,
  SUM(IF(day_hour, adjusted_energy_usage, 0)) energy_usage_day,
  SUM(IF(NOT day_hour, adjusted_energy_usage, 0)) energy_usage_night,
  device_id,
  month, 
  year
FROM temp_with_adjustments
GROUP BY device_id, month, year  

如您所见,我刚刚为temp_with_adjustments添加了逻辑(并在temp中添加了rolling_sum来支持这一点)-其余相同