在远程源中的表上调用dplyr::arrange()
会添加一个“ Ordered by:...”标志。是否有后续功能删除远程表上的“ Ordered by:”标志?
考虑示例数据:
tmp_cars_sdf <-
copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)
为此:
glimpse(tmp_cars_sdf)
# Observations: ??
# Variables: 2
# Database: postgres 9.5.3
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13...
# $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26...
考虑:
tmp_cars <-
cars
tmp_cars <-
tmp_cars %>%
arrange(speed, dist)
glimpse(tmp_cars)
# Observations: 50
# Variables: 2
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13,...
# $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34,...
但是:
tmp_cars <-
tmp_cars_sdf %>%
arrange(speed, dist)
glimpse(tmp_cars)
# Observations: ??
# Variables: 2
# Database: postgres 9.5.3
# Ordered by: speed, dist
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13,...
# $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34,...
答案 0 :(得分:1)
dbplyr
倾向于通过添加命令来嵌套子查询。因此,当您添加其他命令时,较早的arrange
可能会出现在子查询中。这似乎是潜在的问题。
删除这些选项的一种方法是直接呈现和编辑基础SQL查询。也许像下面这样:
unarrange = function(table, cols_prev_ordered_by){
db_connection = table$src$con
order_text = paste0("ORDER BY \"",
paste0(cols_prev_ordered_by, collapse = \", \""),
"\"")
query_text = table %>% sql_render() %>% as.character()
new_query_text = gsub(order_text, "", query_text)
sql_query = build_sql(con = db_connection, new_query_text)
return(tbl(db_connection, sql(sql_query)))
}
# example:
tmp_cars <-
tmp_cars_sdf %>%
arrange(speed, dist)
unarrange(c("speed", "dist"))
肯定有比gsub
更为强大的方法来标识和删除查询的按序部分。如果这很重要,您可能要看一下?select_query
,因为它有一个明确的order_by
参数。
答案 1 :(得分:0)
受Simon在OP上的回答和评论的启发,以下功能是一种变通方法,它删除了所有排序(但保留了由于排序而计算出的所有新列)。这可能不是最有效的方法,也不是最底层/直接的方法,我将在此答案的最后再讲到。但是,如果他们{@ 1}认为适合这样做。
dbplyr
使用输入数据:
unarrange <-
function(remote_df) {
existing_groups <- groups(remote_df)
remote_df <-
remote_df %>%
compute()
remote_df <-
tbl(remote_df$src$con,
sql_render(remote_df))
remote_df <-
group_by(remote_df, !!!existing_groups)
return(remote_df)
}
考虑
tmp_cars_sdf <-
copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)
vs
str(tmp_cars_sdf)
# ..$ con <truncated>
# ..$ disco <truncated>
# $ ops:List of 2
# ..$ x : 'ident' chr "tmp_cars_sdf"
# ..$ vars: chr [1:2] "speed" "dist"
# ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
# - attr(*, "class")= chr [1:5] "tbl_PostgreSQLConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
很明显,由于远程表无法进行固有的排序(或分组),因此必须通过tmp_cars_sdf <-
tmp_cars_sdf %>%
arrange(speed, dist)
str(tmp_cars_sdf)
# $ ops:List of 4
# ..$ name: chr "arrange"
# ..$ x :List of 2
# .. ..$ x : 'ident' chr "tmp_cars_sdf"
# .. ..$ vars: chr [1:2] "speed" "dist"
# .. ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
# ..$ dots:List of 2
# .. ..$ : language ~speed
# .. .. ..- attr(*, ".Environment")=<environment: 0x000000002556b260>
# .. ..$ : language ~dist
# .. .. ..- attr(*, ".Environment")=<environment: 0x000000002556b260>
# ..$ args:List of 1
# .. ..$ .by_group: logi FALSE
# ..- attr(*, "class")= chr [1:3] "op_arrange" "op_single" "op"
# - attr(*, "class")= chr [1:5] "tbl_PostgreSQLConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
添加排序实际上会修改R对象的结构,必须存储顺序和分组信息本地,只有在建立最终查询时才会传输。
因此,解决方法使用了三个技巧:首先,使用arrange
生成一个临时表。请注意,执行此操作不会重置组并在本地排序。其次,使用Simon的技巧来提取与该新表相对应的简单选择查询,并覆盖现有的表结构,以便所有分组和排序信息都丢失。为了保留组,该函数将原始组重新添加到该表中。
虽然OP中提供的示例用于显示问题,但之所以出现该问题,是因为依赖于表中某些(分组的)顺序的突变。一旦建立了新的列,就不再需要旧的排序,实际上,由于github上的链接问题,有时有时是一个障碍。这样的例子如下:
compute()
因此:
tmp_cars_sdf <-
copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)
cars_df <-
cars %>%
arrange(speed, dist) %>%
group_by(speed) %>%
mutate(diff_dist_up = dist - lag(dist)) %>%
arrange(speed, desc(dist)) %>%
mutate(diff_dist_down = dist - lag(dist)) %>%
ungroup() %>%
arrange(speed, dist) %>%
data.frame()
有了新功能,我们可以远程复制它:
head(cars_df)
# speed dist diff_dist_up diff_dist_down
# 1 4 2 NA -8
# 2 4 10 8 NA
# 3 7 4 NA -18
# 4 7 22 18 NA
# 5 8 16 NA NA
# 6 9 10 NA NA
然后检查,我们看到:
cars_df_2 <-
tmp_cars_sdf %>%
arrange(speed, dist) %>%
group_by(speed) %>%
mutate(diff_dist_up = dist - lag(dist)) %>%
# unfortunately the next line is needed
# because of https://github.com/tidyverse/dbplyr/issues/345
unarrange() %>%
arrange(speed, desc(dist)) %>%
mutate(diff_dist_down = dist - lag(dist)) %>%
ungroup() %>%
unarrange() %>%
collect() %>%
arrange(speed, dist) %>%
data.frame()
第一个问题是必须调用使用资源的identical(cars_df, cars_df_2)
# [1] TRUE
。第二个问题是,必须可以修改对远程表进行编码的R对象的结构,但是我不知道如何从该结构构建查询,因此无法做到这一点。