来自dplyr数据库后端

时间:2017-03-02 21:21:22

标签: sql r caching dplyr

我在AWS Redshift数据库上使用dplyr数据库后端。而且因为有些查询需要永远返回,我想缓存它们。我知道基础数据不会改变,所以如果查询没有改变,那么结果集也不会改变。

我在其他地方采取的方法是

  • 哈希查询字符串
  • 将查询结果保存到{hash}.rds文件
  • 在下一次运行脚本时,如果哈希没有更改,请从磁盘读取结果,否则重新运行查询

我一直在用dplyr尝试相同的方法。不幸的是,即使操作保持不变,dplyr生成的SQL查询字符串也会发生变化:

df %>%
  select(week, person_id) %>%
  group_by(person_id) %>%
  mutate(weeks_active = n()) %>%
  arrange(weeks_active) %>% 
  dplyr::sql_render()

产生

<SQL> SELECT *
FROM (SELECT "week", "person_id", COUNT(*) OVER (PARTITION BY "person_id") AS "weeks_active"
FROM (SELECT "week" AS "week", "person_id" AS "person_id"
FROM "fct_person_week") "zznunjjdwe") "ltyyfmiahu"
ORDER BY "weeks_active"

在第一次运行时

<SQL> SELECT *
FROM (SELECT "week", "person_id", COUNT(*) OVER (PARTITION BY "person_id") AS "weeks_active"
FROM (SELECT "week" AS "week", "person_id" AS "person_id"
FROM "fct_person_week") "stxupavckd") "oaknuxjexc"
ORDER BY "weeks_active"

在第二个。有没有办法保持表别名固定?是否有其他查询摘要在多次运行中相同?或者我应该考虑其他缓存方式吗?

1 个答案:

答案 0 :(得分:0)

您可以使用compute()创建临时表。另一种选择是获取生成的SQL并将其转换为View,因此R开发人员只需将其称为表名。