在MySQL中,我可以这样查询:
select
cast(from_unixtime(t.time, '%Y-%m-%d %H:00') as datetime) as timeHour
, ...
from
some_table t
group by
timeHour, ...
order by
timeHour, ...
timeHour
中的GROUP BY
是选择表达式的结果。
但我刚刚尝试了与Sqark SQL
类似的查询,我收到了错误
Error: org.apache.spark.sql.AnalysisException:
cannot resolve '`timeHour`' given input columns: ...
我对Spark SQL
的查询是:
select
cast(t.unixTime as timestamp) as timeHour
, ...
from
another_table as t
group by
timeHour, ...
order by
timeHour, ...
此构造是否可以在Spark SQL
?
答案 0 :(得分:4)
这个构造在Spark SQL中是否可行?
是的,。您可以通过两种方式使它在Spark SQL中工作,以便在GROUP BY
和ORDER BY
子句中使用新列
使用子查询方法1:
SELECT timeHour, someThing FROM (SELECT
from_unixtime((starttime/1000)) AS timeHour
, sum(...) AS someThing
, starttime
FROM
some_table)
WHERE
starttime >= 1000*unix_timestamp('2017-09-16 00:00:00')
AND starttime <= 1000*unix_timestamp('2017-09-16 04:00:00')
GROUP BY
timeHour
ORDER BY
timeHour
LIMIT 10;
使用WITH //优雅方式接近2:
-- create alias
WITH table_aliase AS(SELECT
from_unixtime((starttime/1000)) AS timeHour
, sum(...) AS someThing
, starttime
FROM
some_table)
-- use the same alias as table
SELECT timeHour, someThing FROM table_aliase
WHERE
starttime >= 1000*unix_timestamp('2017-09-16 00:00:00')
AND starttime <= 1000*unix_timestamp('2017-09-16 04:00:00')
GROUP BY
timeHour
ORDER BY
timeHour
LIMIT 10;
使用Scala创建的Spark DataFrame(wo SQL)API的替代方案:
// This code may need additional import to work well
val df = .... //load the actual table as df
import org.apache.spark.sql.functions._
df.withColumn("timeHour", from_unixtime($"starttime"/1000))
.groupBy($"timeHour")
.agg(sum("...").as("someThing"))
.orderBy($"timeHour")
.show()
//another way - as per eliasah comment
df.groupBy(from_unixtime($"starttime"/1000).as("timeHour"))
.agg(sum("...").as("someThing"))
.orderBy($"timeHour")
.show()
答案 1 :(得分:1)
我想在这里自己提供答案......
在我看来,我们必须重写查询并重复GROUP BY子句中select表达式的计算。例如:
select
from_unixtime((t.starttime/1000)) as timeHour
, sum(...) as someThing
from
some_table as t
where
t.starttime>=1000*unix_timestamp('2017-09-16 00:00:00')
and t.starttime<=1000*unix_timestamp('2017-09-16 04:00:00')
group by
from_unixtime((t.starttime/1000))
order by
from_unixtime((t.starttime/1000))
limit 10;