dplyr可以使用向量修改多列spark DF吗?

时间:2017-12-10 22:31:02

标签: r apache-spark dplyr apache-spark-sql sparklyr

我是新工作的火花。我想将一个火花数据帧的大量列乘以向量中的值。到目前为止,使用mtcars我使用了for循环和mutate_at,如下所示:

library(dplyr)
library(rlang)
library(sparklyr)

sc1 <- spark_connect(master = "local")

mtcars_sp = sdf_copy_to(sc1, mtcars, overwrite = TRUE)

mtcars_cols = colnames(mtcars_sp)
mtc_factors = 0:10 / 10

# mutate 1 col at a time
for (i in 1:length(mtcars_cols)) {
    # set equation and print - use sym() convert a string
    mtcars_eq = quo( UQ(sym(mtcars_cols[i])) * mtc_factors[i])
    # mutate formula - LHS resolves to a string, RHS a quosure
    mtcars_sp = mtcars_sp %>% 
        mutate(!!mtcars_cols[i] := !!mtcars_eq )
}

dbplyr::sql_render(mtcars_sp)
mtcars_sp

这适用于mtcars。但是,它导致嵌套的SQL查询被发送到spark,如sql_render所示,并且会分解许多列。在这种情况下,是否可以使用dplyr代替发送单个SQL查询?

顺便说一句,我宁愿不转置数据,因为它太贵了。任何帮助将不胜感激!

1 个答案:

答案 0 :(得分:2)

一般情况下,您可以great answer

使用Artem Sokolov
library(glue)

mtcars_sp %>% 
  mutate(!!! setNames(glue("{mtcars_cols} * {mtc_factors}"), mtcars_cols) %>% 
    lapply(parse_quosure))

但是,如果这是MLlib算法的输入,则ft_vector_assemblerft_elementwise_product结合可能更合适:

scaled <- mtcars_sp %>% 
  ft_vector_assembler(mtcars_cols, "features") %>% 
  ft_elementwise_product("features", "features_scaled", mtc_factors)

结果可以分开(如果您使用MLlib,我不建议您使用sdf_separate_column分成单个列:

scaled %>% 
  select(features_scaled) %>% 
  sdf_separate_column("features_scaled", mtcars_cols)