Question

我们在AWS S3中建立了一个数据湖，其中有来自每5分钟发送数万个传感器的传感器记录。我正在尝试通过AWS Athena（Presto）延长传感器的电池寿命。

对于每个样本记录，每个传感器将其记录索引增加1。当取出/更换电池时，记录索引再次从1开始。

来自所有传感器的所有数据都在同一表中，并且顺序基于在AWS接收的时间（不能假定已排序）。有重复项。

示例数据集：

+--------+-----------------+------------+---------+
| sensor | samplerecorded  | record     | battery |
+--------+-----------------+------------+---------+
|   2930 | 4/26/2019 22:04 | 1          |     3.3 |
|   2930 | 4/25/2019 22:04 | 5          |       2 |
|   2930 | 4/24/2019 22:04 | 6          |     1.8 |
|   2930 | 4/23/2019 22:04 | 4          |     2.5 |
|   2930 | 4/23/2019 22:04 | 4          |     2.5 |
|   2931 | 4/20/2019 22:04 | 1          |     3.4 |
|   2931 | 4/19/2019 22:04 | 3          |       2 |
|   2930 | 4/22/2019 22:04 | 3          |       3 |
|   2931 | 4/18/2019 22:04 | 2          |       3 |
|   2931 | 4/18/2019 22:04 | 2          |       3 |
|   2931 | 4/17/2019 22:04 | 1          |     3.3 |
|   2931 | 4/17/2019 22:04 | 1          |     3.3 |
|   2931 | 4/17/2019 22:04 | 3          |       2 |
|   2931 | 4/16/2019 22:04 | 2          |     2.5 |
|   2931 | 4/15/2019 22:04 | 1          |       3 |
|   2931 | 4/13/2019 22:04 | 5          |     1.9 |
|   2931 | 4/12/2019 22:04 | 4          |       2 |
|   2931 | 4/11/2019 22:04 | 3          |     2.1 |
|   2930 | 4/21/2019 22:04 | 2          |     2.8 |
|   2930 | 4/20/2019 22:04 | 1          |       3 |
|   2930 | 4/19/2019 22:04 | 8          |       2 |
+--------+-----------------+------------+---------+

对于每个传感器，我想对完整的数据集进行分组（记录1到N，N =下一个（排序的）样本记录从“ 1”重新开始之前的行），并在数据集的开始和结束时获取电池电压和日期加上带有电池寿命的计算列。

想要的输出：

+--------+------------+-----------+-----------+---------+---------------+
| sensor | date_start | date_end  | bat_start | bat_end | bat_life_days |
+--------+------------+-----------+-----------+---------+---------------+
|   2930 | 4/20/2019  | 4/24/2019 |         3 |     1.8 |             4 |
|   2931 | 4/15/2019  | 4/17/2019 |         3 |       2 |             2 |
|   2931 | 4/17/2019  | 4/24/2019 |       3.3 |     1.8 |             7 |
+--------+------------+-----------+-----------+---------+---------------+

当前的实现方式大多数都可以使用，但是数据中的一些杂乱杂乱的信号会在各处产生不良结果。现在，我从每个数据集的开始和结束获取数据。仍然需要在新电池（> 3000 mV）和空电池（<2000mV）之间按序列号“合并”行，因为客户在无法正常工作时会定期拉电池，这会使结果失真。

有关如何实现此建议？而在下面做哪种更好的方式呢？

WITH partitioned_data AS (
  SELECT 
    cs.*, 
    ROW_NUMBER() OVER (
      PARTITION BY serialnumber 
      ORDER BY 
        samplerecorded ASC
    ) AS rn 
  FROM 
    crawler_samples cs 
  WHERE 
    serialnumber LIKE '293%' 
    AND -- noise in data
    battVoltage IS NOT NULL 
    AND battVoltage <> 0 
    AND battvoltage > 1000 -- millivolts in source
    ), 
first_sample_table AS (
  SELECT 
    rn AS start_rn, 
    serialnumber AS start_sn, 
    samplerecorded AS t_start, 
    battvoltage AS batt_start, 
    count(*) OVER (PARTITION BY serialnumber) AS reboots 
  FROM 
    partitioned_data 
  WHERE 
    record = 0
), 
-- list of excluded devices
excluded_devices AS (
  SELECT 
    DISTINCT start_sn AS sn 
  FROM 
    first_sample_table 
  WHERE 
    -- devices that sent illegal data at start of dataset
    batt_start = 65535
), 
-- remove excluded devices
filtered_fst AS (
  SELECT 
    * 
  FROM 
    first_sample_table 
    LEFT JOIN excluded_devices ON first_sample_table.start_sn = excluded_devices.sn 
  WHERE 
    excluded_devices.sn IS NULL
), 
--add the previous battery empty to same row
last_row_table AS (
  SELECT 
    filtered_fst.*, 
    partitioned_data.rn, 
    LAG(partitioned_data.battVoltage, 2) OVER (
      PARTITION BY partitioned_data.serialnumber 
      ORDER BY 
        partitioned_data.samplerecorded
    ) AS batt_end, 
    LAG(
      partitioned_data.samplerecorded, 
      2
    ) OVER (
      PARTITION BY partitioned_data.serialnumber 
      ORDER BY 
        partitioned_data.samplerecorded
    ) AS t_end 
  FROM 
    partitioned_data 
    LEFT JOIN filtered_fst ON partitioned_data.serialnumber = filtered_fst.start_sn 
    AND partitioned_data.rn = filtered_fst.start_rn
), 
-- clean out join in next cte or LAG would be incorrect (WHERE executes first)
cleaned_last_row_table AS (
  SELECT 
    * 
  FROM 
    last_row_table 
  WHERE 
    start_sn IS NOT NULL
), 
--offset end columns to match start and stop
offset_end AS (
  SELECT 
    start_sn AS serialnumber, 
    start_rn AS rownum, 
    t_start, 
    batt_start, 
    lead(t_end, 1) OVER (
      PARTITION BY start_sn 
      ORDER BY 
        t_start
    ) AS t_end, 
    lead(batt_end, 1) OVER (
      PARTITION BY start_sn 
      ORDER BY 
        t_start
    ) AS batt_end, 
    reboots 
  FROM 
    cleaned_last_row_table
), 
-- append calculated column (easier to read)
FINAL AS (
  SELECT 
    *, 
    date_diff('day', t_start, t_end) AS batt_time_days 
  FROM 
    offset_end 
  WHERE 
    t_start IS NOT NULL 
    AND t_end IS NOT NULL
) 
SELECT 
  * 
FROM 
  FINAL 
ORDER BY 
  serialnumber, 
  rownum;

从时间序列的非单调变化中提取值

0 个答案: