这是我正在尝试执行的SQL查询:
select *,count(dummy) over(partition by dummy) as total_count
from aaca711a5e78441cdbf062f1d630ee261
WHERE (max_timestamp BETWEEN '2017-01-01' AND '2018-01-01')
ORDER BY max_timestamp DESC
据我所知,在BETWEEN AND操作中,两个值都包含在内。此处,此查询无法获取与2018-01-01相对应的记录。
我将查询更改为:
select *,count(dummy) over(partition by dummy) as total_count
from aaca711a5e78441cdbf062f1d630ee261
WHERE (max_timestamp >= '2017-01-01' AND max_timestamp <= '2018-01-01')
ORDER BY max_timestamp DESC
不过,它还没有用。 然后我尝试了这个:
select *,count(dummy) over(partition by dummy) as total_count
from aaca711a5e78441cdbf062f1d630ee261
WHERE (max_timestamp >= '2017-01-01' AND max_timestamp <= '2018-01-02')
ORDER BY max_timestamp DESC
它可以获取与2018-01-01相关的记录。
这可能是什么原因?我该如何解决这个问题? 提前谢谢。
答案 0 :(得分:2)
这是您的查询:
select *, count(dummy) over (partition by dummy) as total_count
from aaca711a5e78441cdbf062f1d630ee261
where max_timestamp BETWEEN '2017-01-01' AND '2018-01-01'
order by max_timestamp DESC;
根本不要使用between
日期时间。使用显式逻辑:
select *, count(dummy) over (partition by dummy) as total_count
from aaca711a5e78441cdbf062f1d630ee261
where max_timestamp >= '2017-01-01' and
max_timestamp < '2018-01-02' --> notice this is one day later
order by max_timestamp DESC;
问题是你在日期有一个时间组件。
Aaron Bertrand在他的博客中What do BETWEEN
and the devil have in common?解释了这一点(鉴于BETWEEN
肯定确实存在,我对标题感到好笑,但对于魔鬼的存在存在更多争议。)< / p>
答案 1 :(得分:0)
这是Spark的一个已知问题。
有关详细信息,请参阅此链接:https://issues.apache.org/jira/browse/SPARK-10837
我已经使用spark提供的date_add函数解决了这个问题。 所以最后一个日期更改为 date_add(endDate,1),以便我们获取所有值,包括与上一个日期相对应的值。