我创建了这个最小的工作示例。首先创建一个数据框,我们需要一个简单的函数在spark工作者上运行。
# test spatial analysis on spark via sparklyr
require(sparklyr)
require(dplyr)
# spark configuration
conf <- spark_config()
conf$`sparklyr.shell.driver-memory` <- "2G"
conf$spark.memory.fraction <- 0.8
# start connection to spark
sc <- spark_connect(master = "local", config = conf, version = "2.2.0")
# create a simmple spatial object
# points taken from google maps >> WGS 84
# GPS latitude ong 52 longitude ongeveer 5
p1 <- c(51.9077783,4.4815503,0.0000000)
p2 <- c(51.906471,4.4833524,15.0000000)
p3 <- c(51.9077888,4.4747269,5.0000000)
df <- as.data.frame(rbind(p1,p2,p3))
names(df) <- c("lat","long","z")
df$id <- rownames(df)
# convert to spatial data
require(sp)
# function to run on spark
myTransform <- function (df) {
require(sp) ## <<< UPDATE
WGScoor<- df
coordinates(WGScoor)=~long+lat
# add projection
proj4string(WGScoor)<- CRS("+proj=longlat +datum=WGS84")
return(WGScoor)
}
sp.df <- myTransform(df)
# load data to spark
mysdf <- sdf_copy_to(sc,df, overwrite=TRUE)
#df3 <- sdf_copy_to(sc,sp.df, overwrite=TRUE)
# nu op spark
# sdf_len(sc, 10) %>% spark_apply(function(df) df * 10
df_loc <- mysdf %>% spark_apply(myTransform, packages = c("sp") ) %>%
collect()
# disconnect
spark_disconnect()
我知道返回的对象是一个空间数据框,而不是一个火花RDD,所以我不希望它完全运行。
我的第一个问题:
更新:在函数中添加require(sp)解决了此问题
如何将软件包'sp'正确传递给spark worker?
日志显示:
18/08/27 14:36:50 ERROR sparklyr: RScript (9000) terminated unexpectedly: could not find function "coordinates<-"
18/08/27 14:36:50 ERROR sparklyr: Worker (9000) failed to complete R process
(18/08/27 14:36:50 ERROR sparklyr: Worker (9000) failed to run rscript: ,java.lang.Exception: sparklyr worker rscript failure with status 255, check worker logs for details.)
因此显然它找不到函数'coordinates',在我看来'sp'软件包不可用...
第二个问题: 如何在保存数据空间详细信息的spark中将空间数据帧转换为RDD?现在,它返回一个常规数据帧,到空间数据帧的转换在哪里进行?
在最终应用程序中,我需要创建形状文件并将其直接保存在hdfs上...