Question

我们已经在Spark Thrift Server中缓存了表。每周都会缓存数据：

cache table event_2018_12_1 as select * from ... where ...;
cache table event_2018_12_2 as select * from ... where ...;

由于我们不断更新源数据（Cassandra），因此需要刷新缓存：

refresh table event_2018_12_1;
select count(*) from event_2018_12_1;

refresh table event_2018_12_2;
select count(*) from event_2018_12_2

刷新操作是惰性的，因此我需要使用count(*)操作来触发刷新。

问题是其他客户端在刷新表的同时从缓存表中选择数据-选择挂起，直到缓存完全刷新（需要几分钟才能完成）。

我想异步刷新缓存，并且仅在加载完成时才公开新鲜数据（类似于番石榴的LoadingCache行为）。

如何在Spark中实现这一目标？

可能的解决方法：

在缓存表上方创建一个视图作为层：

create or replace TEMPORARY VIEW event_2018_10 as 
select * from event_2018_12_1
union all select * from event_2018_12_2

代替刷新只是创建一个新的缓存并替换视图：

cache table event_2018_12_1_c1 as select * from ... where ...;

create or replace TEMPORARY VIEW event_2018_10 as 
select * from event_2018_12_1_c1
union all select * from event_2018_12_2;

--after some delay drop the first cache
drop table event_2018_12_1;

我担心更改视图的DDL并同时进行选择。会出问题吗？

Spark异步刷新缓存表

0 个答案: