我们在AWS S3中建立了一个数据湖,其中有来自每5分钟发送数万个传感器的传感器记录。我正在尝试通过AWS Athena(Presto)延长传感器的电池寿命。
对于每个样本记录,每个传感器将其记录索引增加1。当取出/更换电池时,记录索引再次从1开始。
来自所有传感器的所有数据都在同一表中,并且顺序基于在AWS接收的时间(不能假定已排序)。有重复项。
示例数据集:
+--------+-----------------+------------+---------+
| sensor | samplerecorded | record | battery |
+--------+-----------------+------------+---------+
| 2930 | 4/26/2019 22:04 | 1 | 3.3 |
| 2930 | 4/25/2019 22:04 | 5 | 2 |
| 2930 | 4/24/2019 22:04 | 6 | 1.8 |
| 2930 | 4/23/2019 22:04 | 4 | 2.5 |
| 2930 | 4/23/2019 22:04 | 4 | 2.5 |
| 2931 | 4/20/2019 22:04 | 1 | 3.4 |
| 2931 | 4/19/2019 22:04 | 3 | 2 |
| 2930 | 4/22/2019 22:04 | 3 | 3 |
| 2931 | 4/18/2019 22:04 | 2 | 3 |
| 2931 | 4/18/2019 22:04 | 2 | 3 |
| 2931 | 4/17/2019 22:04 | 1 | 3.3 |
| 2931 | 4/17/2019 22:04 | 1 | 3.3 |
| 2931 | 4/17/2019 22:04 | 3 | 2 |
| 2931 | 4/16/2019 22:04 | 2 | 2.5 |
| 2931 | 4/15/2019 22:04 | 1 | 3 |
| 2931 | 4/13/2019 22:04 | 5 | 1.9 |
| 2931 | 4/12/2019 22:04 | 4 | 2 |
| 2931 | 4/11/2019 22:04 | 3 | 2.1 |
| 2930 | 4/21/2019 22:04 | 2 | 2.8 |
| 2930 | 4/20/2019 22:04 | 1 | 3 |
| 2930 | 4/19/2019 22:04 | 8 | 2 |
+--------+-----------------+------------+---------+
对于每个传感器,我想对完整的数据集进行分组(记录1到N,N =下一个(排序的)样本记录从“ 1”重新开始之前的行),并在数据集的开始和结束时获取电池电压和日期加上带有电池寿命的计算列。
想要的输出:
+--------+------------+-----------+-----------+---------+---------------+
| sensor | date_start | date_end | bat_start | bat_end | bat_life_days |
+--------+------------+-----------+-----------+---------+---------------+
| 2930 | 4/20/2019 | 4/24/2019 | 3 | 1.8 | 4 |
| 2931 | 4/15/2019 | 4/17/2019 | 3 | 2 | 2 |
| 2931 | 4/17/2019 | 4/24/2019 | 3.3 | 1.8 | 7 |
+--------+------------+-----------+-----------+---------+---------------+
当前的实现方式大多数都可以使用,但是数据中的一些杂乱杂乱的信号会在各处产生不良结果。现在,我从每个数据集的开始和结束获取数据。仍然需要在新电池(> 3000 mV)和空电池(<2000mV)之间按序列号“合并”行,因为客户在无法正常工作时会定期拉电池,这会使结果失真。
有关如何实现此建议? 而在下面做哪种更好的方式呢?
WITH partitioned_data AS (
SELECT
cs.*,
ROW_NUMBER() OVER (
PARTITION BY serialnumber
ORDER BY
samplerecorded ASC
) AS rn
FROM
crawler_samples cs
WHERE
serialnumber LIKE '293%'
AND -- noise in data
battVoltage IS NOT NULL
AND battVoltage <> 0
AND battvoltage > 1000 -- millivolts in source
),
first_sample_table AS (
SELECT
rn AS start_rn,
serialnumber AS start_sn,
samplerecorded AS t_start,
battvoltage AS batt_start,
count(*) OVER (PARTITION BY serialnumber) AS reboots
FROM
partitioned_data
WHERE
record = 0
),
-- list of excluded devices
excluded_devices AS (
SELECT
DISTINCT start_sn AS sn
FROM
first_sample_table
WHERE
-- devices that sent illegal data at start of dataset
batt_start = 65535
),
-- remove excluded devices
filtered_fst AS (
SELECT
*
FROM
first_sample_table
LEFT JOIN excluded_devices ON first_sample_table.start_sn = excluded_devices.sn
WHERE
excluded_devices.sn IS NULL
),
--add the previous battery empty to same row
last_row_table AS (
SELECT
filtered_fst.*,
partitioned_data.rn,
LAG(partitioned_data.battVoltage, 2) OVER (
PARTITION BY partitioned_data.serialnumber
ORDER BY
partitioned_data.samplerecorded
) AS batt_end,
LAG(
partitioned_data.samplerecorded,
2
) OVER (
PARTITION BY partitioned_data.serialnumber
ORDER BY
partitioned_data.samplerecorded
) AS t_end
FROM
partitioned_data
LEFT JOIN filtered_fst ON partitioned_data.serialnumber = filtered_fst.start_sn
AND partitioned_data.rn = filtered_fst.start_rn
),
-- clean out join in next cte or LAG would be incorrect (WHERE executes first)
cleaned_last_row_table AS (
SELECT
*
FROM
last_row_table
WHERE
start_sn IS NOT NULL
),
--offset end columns to match start and stop
offset_end AS (
SELECT
start_sn AS serialnumber,
start_rn AS rownum,
t_start,
batt_start,
lead(t_end, 1) OVER (
PARTITION BY start_sn
ORDER BY
t_start
) AS t_end,
lead(batt_end, 1) OVER (
PARTITION BY start_sn
ORDER BY
t_start
) AS batt_end,
reboots
FROM
cleaned_last_row_table
),
-- append calculated column (easier to read)
FINAL AS (
SELECT
*,
date_diff('day', t_start, t_end) AS batt_time_days
FROM
offset_end
WHERE
t_start IS NOT NULL
AND t_end IS NOT NULL
)
SELECT
*
FROM
FINAL
ORDER BY
serialnumber,
rownum;