SparkSQL现在允许在GROUP BY和ORDER BY中重用select表达式的结果?

时间:2018-02-28 23:35:24

标签: apache-spark hive apache-spark-sql

大约六个月前(2017年9月),我发布了一个问题,询问SparkSQL是否允许重复使用GROUP BY和ORDER BY中的选择表达式的结果。答案是否定的。人们强迫使用不同的解决方法,例如编写子查询或在GROUP BY中重复表达式(请参阅帖子here

但我只是再次尝试了查询。也许我们已升级为使用较新版本的SparkSQL。我注意到SparkSQL现在实际上允许重用。

更新了新的查询示例和新的观察

这是一个例子。现在允许查询(新方法):

select 
  count(*) cnt,
  date(from_unixtime(t.starttime/1000)) as timeDate  
from someTable as t
group by 
  timeDate
order by 
  timeDate desc
/* Okay */

我们还能用旧方法吗?根据我的实验,答案是 是的,不。我不知道是否有错误或意图。

允许此查询:

select 
  count(*) cnt,
  date(from_unixtime(t.starttime/1000)) as timeDate  
from someTable as t
group by 
  date(from_unixtime(t.starttime/1000))
order by 
  date(from_unixtime(t.starttime/1000)) desc
/* Okay */

但是以下查询会触发异常:

select 
  count(*) cnt,
  cast(date(from_unixtime(t.starttime/1000)) as date) as timeDate  
from someTable as t
group by 
  date(from_unixtime(t.starttime/1000))
order by 
  starttime desc  
/* Error: AnalysisException */

错误:

Error: org.apache.spark.sql.AnalysisException: \
cannot resolve '`starttime`' given input columns: \
[cnt, timeDate]; line 7 pos 9;

查看下面的更多失败/成功案例

我理解在上面的示例中,cast(... as date)是多余的,但在这里我只是想探索语法允许的内容。如果我们使用新方式(重用select表达式的结果),则没有问题。

select 
  count(*) cnt,
  cast(date(from_unixtime(t.starttime/1000)) as date) as timeDate  
from someTable as t
group by 
  timeDate
order by 
  timeDate desc
/* Okay */

我喜欢这种新方式,但我想确保我已经正确观察过。有人可以证实吗?

我忘记了六个月前的版本,但这是当前的软件版本:

我们的软件版本:

  • HDFS:2.7.3
  • Hive:1.2.1
  • HBase:1.1.2
  • Spark:2.0.2
  • YARN:2.7.3
  • Spark Thrift Server:2.2.0
  • SQuirreL SQL客户端:3.8.0

附录:更多失败/成功案例

似乎ORDER BY必须以某种方式匹配GROUP BY,或者 将抛出异常。

select 
  count(*) cnt,
  date(from_unixtime(t.starttime/1000)) as timeDate  
from someTable as t
group by 
  date(from_unixtime(t.starttime/1000))
order by 
  t.starttime desc 
/* Error: AnalysisException */  

select 
  count(*) cnt,
  cast(date(from_unixtime(t.starttime/1000)) as date) as timeDate  
from someTable as t
group by 
  cast(date(from_unixtime(t.starttime/1000)) as date)
order by 
  cast(date(from_unixtime(t.starttime/1000)) as date) desc   
/* Okay */      

0 个答案:

没有答案