我有一个Dataset<Row>
,其中包含六列,如下所示:
+---------------+---------------+----------------+-------+--------------+--------+
| time | thingId | controller | module| variableName | value |
+---------------+---------------+----------------+-------+--------------+--------+
|1554188264901 | 0002019000000| 0 | 0 |Voltage | 5 |
|1554188264901 | 0002019000000| 0 | 0 |SetPoint | 7 |
|1554188276412 | 0002019000000| 0 | 0 |Voltage | 9 |
|1554188276412 | 0002019000000| 0 | 0 |SetPoint | 10 |
|1554188639406 | 0002019000000| 0 | 0 |SetPoint | 6 |
+---------------+---------------+----------------+-------+--------------+--------+
最终目标:
针对MAX(time)
,thingId
,controller
和module
的组合,基于variableName
获取最后更新的行
因此所需的输出在所有行中都应有MAX(time)
,其余的variableName值应具有last_updatedValue。
+---------------+---------------+----------------+-------+--------------+--------+
| time | thingId | controller | module| variableName | value |
+---------------+---------------+----------------+-------+--------------+--------+
|1554188639406 | 0002019000000| 0 | 0 |SetPoint | 6 |
+---------------+---------------+----------------+-------+--------------+--------+
,为此特定的thingId,控制器和模块,列variableName
具有两个值('Voltage'
和'SetPoint'
),因此列{{1}中的值Voltage
具有两个值},它应返回值为variableName
和Voltage
的最后更新的行。
如下所示, 预期输出:
MAX(time)
我尝试过的操作:
我尝试了 +---------------+---------------+----------------+-------+--------------+--------+
| time | thingId | controller | module| variableName | value |
+---------------+---------------+----------------+-------+--------------+--------+
|1554188276412 | 0002019000000| 0 | 0 |Voltage | 9 |
|1554188639406 | 0002019000000| 0 | 0 |SetPoint | 6 |
+---------------+---------------+----------------+-------+--------------+--------+
来获取此信息,但是子查询中的列应该已经聚合了,我尝试了多种方式,但是没有运气。
例如下面的代码:
Scalar sub-query
引发错误:
必须为标量子查询汇总相关的标量变量
我该如何解决?如果有任何解决方法,请建议我。
谢谢!
答案 0 :(得分:1)
问题似乎是您实际上既需要聚合又需要排序。
您需要使该值与MAX(time)
直接相关,该列的特定分组值 variableName
,因此基本上是同一行中的值。由于在SQL中没有聚合函数可以执行此操作,因此您可以对子查询结果进行排序。
因此,要获得所需的 “最后更新” 行,您可以按{ {1}},然后递减,然后将结果限制为1行。
可能是这样的:
time
P.S。我来自SQL Server背景知识,因此通常我会做Dataset<Row> update = spark.sql("SELECT
MAX(p.time) max_time,
p.thingId, p.controller, p.module, p.variableName,
(SELECT d.value FROM abc d WHERE d.thingId=p.thingId AND d.controller=p.controller AND d.module=p.module AND d.variableName=p.variableName
ORDER BY time DESC LIMIT 1) AS [lastUpdatedValue]
FROM abc p
GROUP BY thingId,controller,module,variableName")
。我不确定TOP 1
在Apache Spark SQL中是否具有相同的效果。
基本上是在谈论Spark中的聚合函数,称为LIMIT 1
。
也许在子查询中使用它可以解决问题?
first
答案 1 :(得分:1)
我最终在spark数据集中使用struct
解决了这个问题。
输入数据集
+---------------+---------------+----------------+-------+--------------+--------+
| time | thingId | controller | module| variableName | value |
+---------------+---------------+----------------+-------+--------------+--------+
|1554188264901 | 0002019000000| 0 | 0 |Voltage | 5 |
|1554188264901 | 0002019000000| 0 | 0 |SetPoint | 7 |
|1554188276412 | 0002019000000| 0 | 0 |Voltage | 9 |
|1554188276412 | 0002019000000| 0 | 0 |SetPoint | 10 |
|1554188639406 | 0002019000000| 0 | 0 |SetPoint | 6 |
+---------------+---------------+----------------+-------+--------------+--------+
Dataset<Row> intermediate = inputDS.groupby("thingId","controller","module","variableName").agg(max(struct("time","value")).as("time_value_struct")).select("thingId","controller","module","variableName","time_value_struct.*");
//above code gives me intermediate output
+---------------+---------------+----------------+-------+--------------+--------+
| time | thingId | controller | module| variableName | value |
+---------------+---------------+----------------+-------+--------------+--------+
|1554188276412 | 0002019000000| 0 | 0 |Voltage | 9 |
|1554188639406 | 0002019000000| 0 | 0 |SetPoint | 6 |
+---------------+---------------+----------------+-------+--------------+--------+
所以现在我的任务是从time
列中获取最大值,并为使用的sql的那个somethingId,控制器和模块填充它,如下所示:
intermediate.createOrReplaceTempView("intermediate");
Dataset<Row> outputDS = spark.sql("select B.time,A.thingId,A.controller,A.module,A.variableName,A.value from intermediate A
inner join (select thingId,controller,module,MAX(time)time from intermediate group by thingId,controller,module) B
on A.thingId=B.thingId and A.controller=B.controller and A.module=B.module");
这给了我们预期的产量
+---------------+---------------+----------------+-------+--------------+--------+
| time | thingId | controller | module| variableName | value |
+---------------+---------------+----------------+-------+--------------+--------+
|1554188639406 | 0002019000000| 0 | 0 |Voltage | 9 |
|1554188639406 | 0002019000000| 0 | 0 |SetPoint | 6 |
+---------------+---------------+----------------+-------+--------------+--------+
所以我现在可以进行枢纽操作,以获取每个thingId,控制器和模块的最新更新值
如果我能够找出一些有效的sql
查询而不是内部联接,那我会知道中间的sql
里面有联接。
谢谢@johwhite的帮助