必须在Spark中为标量子查询聚合相关的标量变量

时间:2019-04-02 09:12:09

标签: sql apache-spark group-by apache-spark-sql

我有一个Dataset<Row>,其中包含六列,如下所示:

 +---------------+---------------+----------------+-------+--------------+--------+
 |  time         | thingId       |     controller | module| variableName |  value |
 +---------------+---------------+----------------+-------+--------------+--------+
 |1554188264901  |  0002019000000|        0       | 0     |Voltage       |    5   |
 |1554188264901  |  0002019000000|        0       | 0     |SetPoint      |    7   |
 |1554188276412  |  0002019000000|        0       | 0     |Voltage       |    9   |
 |1554188276412  |  0002019000000|        0       | 0     |SetPoint      |    10  |  
 |1554188639406  |  0002019000000|        0       | 0     |SetPoint      |    6   |
 +---------------+---------------+----------------+-------+--------------+--------+

最终目标:

针对MAX(time)thingIdcontrollermodule的组合,基于variableName获取最后更新的行

因此所需的输出在所有行中都应有MAX(time),其余的variableName值应具有last_updatedValue。

 +---------------+---------------+----------------+-------+--------------+--------+
 |  time         | thingId       |     controller | module| variableName |  value |
 +---------------+---------------+----------------+-------+--------------+--------+
 |1554188639406  |  0002019000000|        0       | 0     |SetPoint      |    6   |
 +---------------+---------------+----------------+-------+--------------+--------+

,为此特定的thingId,控制器和模块,列variableName具有两个值('Voltage''SetPoint'),因此列{{1}中的值Voltage具有两个值},它应返回值为variableNameVoltage最后更新的行

如下所示, 预期输出:

MAX(time)

我尝试过的操作:

我尝试了 +---------------+---------------+----------------+-------+--------------+--------+ | time | thingId | controller | module| variableName | value | +---------------+---------------+----------------+-------+--------------+--------+ |1554188276412 | 0002019000000| 0 | 0 |Voltage | 9 | |1554188639406 | 0002019000000| 0 | 0 |SetPoint | 6 | +---------------+---------------+----------------+-------+--------------+--------+ 来获取此信息,但是子查询中的列应该已经聚合了,我尝试了多种方式,但是没有运气。

例如下面的代码:

Scalar sub-query

引发错误:

  

必须为标量子查询汇总相关的标量变量

我该如何解决?如果有任何解决方法,请建议我。

谢谢!

2 个答案:

答案 0 :(得分:1)

问题似乎是您实际上既需要聚合又需要排序

您需要使该值与MAX(time)直接相关,该列的特定分组值 variableName,因此基本上是同一行中的值。由于在SQL中没有聚合函数可以执行此操作,因此您可以对子查询结果进行排序。

因此,要获得所需的 “最后更新” ,您可以按{ {1}},然后递减,然后将结果限制为1行。

可能是这样的:

time

P.S。我来自SQL Server背景知识,因此通常我会做Dataset<Row> update = spark.sql("SELECT MAX(p.time) max_time, p.thingId, p.controller, p.module, p.variableName, (SELECT d.value FROM abc d WHERE d.thingId=p.thingId AND d.controller=p.controller AND d.module=p.module AND d.variableName=p.variableName ORDER BY time DESC LIMIT 1) AS [lastUpdatedValue] FROM abc p GROUP BY thingId,controller,module,variableName") 。我不确定TOP 1在Apache Spark SQL中是否具有相同的效果。

编辑::我找到了this,谢谢这个答案here

基本上是在谈论Spark中的聚合函数,称为LIMIT 1

也许在子查询中使用它可以解决问题?

first

答案 1 :(得分:1)

我最终在spark数据集中使用struct解决了这个问题。

输入数据集

 +---------------+---------------+----------------+-------+--------------+--------+
 |  time         | thingId       |     controller | module| variableName |  value |
 +---------------+---------------+----------------+-------+--------------+--------+
 |1554188264901  |  0002019000000|        0       | 0     |Voltage       |    5   |
 |1554188264901  |  0002019000000|        0       | 0     |SetPoint      |    7   |
 |1554188276412  |  0002019000000|        0       | 0     |Voltage       |    9   |
 |1554188276412  |  0002019000000|        0       | 0     |SetPoint      |    10  |  
 |1554188639406  |  0002019000000|        0       | 0     |SetPoint      |    6   |
 +---------------+---------------+----------------+-------+--------------+--------+

 Dataset<Row> intermediate = inputDS.groupby("thingId","controller","module","variableName").agg(max(struct("time","value")).as("time_value_struct")).select("thingId","controller","module","variableName","time_value_struct.*");

 //above code gives me intermediate output
 +---------------+---------------+----------------+-------+--------------+--------+
 |  time         | thingId       |     controller | module| variableName |  value |
 +---------------+---------------+----------------+-------+--------------+--------+
 |1554188276412  |  0002019000000|        0       | 0     |Voltage       |    9   |
 |1554188639406  |  0002019000000|        0       | 0     |SetPoint      |    6   |
 +---------------+---------------+----------------+-------+--------------+--------+

所以现在我的任务是从time列中获取最大值,并为使用的sql的那个somethingId,控制器和模块填充它,如下所示:

intermediate.createOrReplaceTempView("intermediate");

Dataset<Row> outputDS = spark.sql("select B.time,A.thingId,A.controller,A.module,A.variableName,A.value from intermediate A 
inner join (select thingId,controller,module,MAX(time)time from intermediate group by thingId,controller,module) B 
on A.thingId=B.thingId and A.controller=B.controller and A.module=B.module");

这给了我们预期的产量

 +---------------+---------------+----------------+-------+--------------+--------+
 |  time         | thingId       |     controller | module| variableName |  value |
 +---------------+---------------+----------------+-------+--------------+--------+
 |1554188639406  |  0002019000000|        0       | 0     |Voltage       |    9   |
 |1554188639406  |  0002019000000|        0       | 0     |SetPoint      |    6   |
 +---------------+---------------+----------------+-------+--------------+--------+

所以我现在可以进行枢纽操作,以获取每个thingId,控制器和模块的最新更新值

如果我能够找出一些有效的sql查询而不是内部联接,那我会知道中间的sql里面有联接。

谢谢@johwhite的帮助