我想使用SparkR / sparklyr首先在Databricks上进行一些数据整理(+其他功能取决于不同的R库)。主要目的是在初始数据帧的每一列上应用一个函数,并将其写入存储。 到目前为止,我发现唯一可行的解决方案是来自SparkR的spark.apply。
在函数writer
中,可以调用来自不同R包或自定义函数的其他函数。
library(SparkR)
library(sparklyr)
library(dplyr)
df1 <- createDataFrame(data.frame(rnorm(1000, 10, 5),rnorm(1000, 10, 5),rnorm(1000, 10, 5)))
inp <- c("COL_1", "COL_2", "COL_3")
names(df1) <- inp
writer <- function(x) {
df2 <- df1 %>%
select(x) %>%
rename(INPUT=x) %>%
#Adding additional columns is just a placeholder for other transformations taking place
mutate(add1=INPUT,
add2=INPUT,
add3=INPUT)
spark_write_parquet(df2, path=paste0(paste0("home/", x), "/"), mode="overwrite")
}
spark.apply(inp, writer)
毕竟,我希望将这3个数据帧作为Parquet文件写入到定义的目录中。不幸的是,我收到以下错误,这些错误似乎表明全局环境中可用的软件包在工作节点上不可用。我尝试了不同的解决方案尝试(如here所述,分别在节点上安装和加载程序包,或者在函数中安装并加载程序包)。
Error in handleErrors(returnStatus, conn) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 14.0 failed 4 times, most recent failure: Lost task 2.3 in stage 14.0 (TID 44, 10.139.64.9, executor 6): org.apache.spark.SparkException: R computation failed with
Error in df1 %>% select(x) %>% rename(INPUT = x) %>% mutate(add1 = INPUT, :
could not find function "%>%"
Calls: compute -> computeFunc -> lapply -> lapply -> FUN
Execution halted
at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:51)
[...]
或者也出现了此消息:
Error in handleErrors(returnStatus, conn) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 12.0 failed 4 times, most recent failure: Lost task 6.3 in stage 12.0 (TID 134, 10.139.64.12, executor 6): org.apache.spark.SparkException: R computation failed with
Warning: namespace ‘sparklyr’ is not available and has been replaced
by .GlobalEnv when processing object ‘’
Error in df1 %>% select(x) %>% rename(INPUT = x) %>% mutate(add1 = INPUT, :
could not find function "%>%"
Calls: compute -> computeFunc -> lapply -> lapply -> FUN
[...]
不幸的是,到目前为止,我的想法还没有解决……希望获得帮助,谢谢!