SparkSQL:如何对两个具有不同时间戳的时间序列数据集求和

时间:2018-10-13 10:12:34

标签: apache-spark apache-spark-sql apache-spark-dataset

我有两个时间序列数据集,我需要使用某种窗口化方法来总结这两个数据集 两个数据集的时间戳不同 结果将是两个数据集的“值”字段的总和,该总和落在结果数据集的窗口内。

Spark中是否有任何内置函数可以轻松地做到这一点?或者我该如何以最佳方式实现这一目标

DataSet-1 
 raw_data_field_id | date_time_epoch | value
-------------------+-----------------+-----------
                23 |   1528766100068 |       131
                23 |   1528765200058 | 130.60001
                23 |   1528764300049 |     130.3
                23 |   1528763400063 |       130
                23 |   1528762500059 | 129.60001
                23 |   1528761600050 |     129.3
                23 |   1528760700051 | 128.89999
                23 |   1528759800047 | 128.60001

DataSet-2
 raw_data_field_id | date_time_epoch | value
-------------------+-----------------+-----------
                24 |   1528766100000 |       41
                24 |   1528765200000 |       60
                24 |   1528764300000 |       30.03
                24 |   1528763400000 |       43
                24 |   1528762500000 |       34.01
                24 |   1528761600000 |       29.36
                24 |   1528760700000 |       48.99
                24 |   1528759800000 |       28.01

1 个答案:

答案 0 :(得分:1)

她是一个例子

scala> d1.show
+-----------------+--------------------+---------+
|raw_data_field_id|     date_time_epoch|    value|
+-----------------+--------------------+---------+
|               23|2018-06-12 01:15:...|    131.0|
|               23|2018-06-12 01:00:...|130.60001|
|               23|2018-06-12 00:45:...|    130.3|
|               23|2018-06-12 00:30:...|    130.0|
|               23|2018-06-12 00:15:...|129.60001|
|               23|2018-06-12 00:00:...|    129.3|
|               23|2018-06-11 23:45:...|128.89999|
|               23|2018-06-11 23:30:...|128.60001|
+-----------------+--------------------+---------+


scala> d2.show
+-----------------+--------------------+-----+
|raw_data_field_id|     date_time_epoch|value|
+-----------------+--------------------+-----+
|               24|2018-06-12 01:15:...| 41.0|
|               24|2018-06-12 01:00:...| 60.0|
|               24|2018-06-12 00:45:...|30.03|
|               24|2018-06-12 00:30:...| 43.0|
|               24|2018-06-12 00:15:...|34.01|
|               24|2018-06-12 00:00:...|29.36|
|               24|2018-06-11 23:45:...|48.99|
|               24|2018-06-11 23:30:...|28.01|
+-----------------+--------------------+-----+
scala> d1.unionAll(d2).show
+-----------------+--------------------+---------+
|raw_data_field_id|     date_time_epoch|    value|
+-----------------+--------------------+---------+
|               23|2018-06-12 01:15:...|    131.0|
|               23|2018-06-12 01:00:...|130.60001|
|               23|2018-06-12 00:45:...|    130.3|
|               23|2018-06-12 00:30:...|    130.0|
|               23|2018-06-12 00:15:...|129.60001|
|               23|2018-06-12 00:00:...|    129.3|
|               23|2018-06-11 23:45:...|128.89999|
|               23|2018-06-11 23:30:...|128.60001|
|               24|2018-06-12 01:15:...|     41.0|
|               24|2018-06-12 01:00:...|     60.0|
|               24|2018-06-12 00:45:...|    30.03|
|               24|2018-06-12 00:30:...|     43.0|
|               24|2018-06-12 00:15:...|    34.01|
|               24|2018-06-12 00:00:...|    29.36|
|               24|2018-06-11 23:45:...|    48.99|
|               24|2018-06-11 23:30:...|    28.01|
+-----------------+--------------------+---------+
import org.apache.spark.sql.functions.window
val df = d1.union(d2)
val avg_df = df.groupBy(window($"date_time_epoch", "15 minutes")).agg(avg($"value"))
avg_df.show
+--------------------+-----------------+
|              window|       avg(value)|
+--------------------+-----------------+
|[2018-06-11 23:45...|        88.944995|
|[2018-06-12 00:30...|             86.5|
|[2018-06-12 01:15...|             86.0|
|[2018-06-11 23:30...|        78.305005|
|[2018-06-12 00:00...|79.33000000000001|
|[2018-06-12 00:45...|           80.165|
|[2018-06-12 00:15...|        81.805005|
|[2018-06-12 01:00...|        95.300005|
+--------------------+-----------------+
avg_df.sort("window.start").select("window.start","window.end","avg(value)").show(truncate = false)
+-------------------+-------------------+-----------------+
|start              |end                |avg(value)       |
+-------------------+-------------------+-----------------+
|2018-06-11 23:30:00|2018-06-11 23:45:00|78.305005        |
|2018-06-11 23:45:00|2018-06-12 00:00:00|88.944995        |
|2018-06-12 00:00:00|2018-06-12 00:15:00|79.33000000000001|
|2018-06-12 00:15:00|2018-06-12 00:30:00|81.805005        |
|2018-06-12 00:30:00|2018-06-12 00:45:00|86.5             |
|2018-06-12 00:45:00|2018-06-12 01:00:00|80.165           |
|2018-06-12 01:00:00|2018-06-12 01:15:00|95.300005        |
|2018-06-12 01:15:00|2018-06-12 01:30:00|86.0             |
+-------------------+-------------------+-----------------+