有没有办法使用dapply在SparkR DataFrame的多个列上进行模式匹配和替换?

时间:2016-10-07 15:20:32

标签: sparkr

在本地运行Spark 2.0

df <- data.frame(a = c("$0.00 ", "$601.19 ", "$601.19 ", "$238.58 "),
             b = c("$148.81 ", "$396.85", "$24.37 ", "$24.37 "),
             c = c("$238.58 ", "$211.15 ", "$422.30 ", "$150.30")
             )

ddf <- as.DataFrame(df)

我希望运行类似的东西

ddf2 <- dapply(ddf, function(x) { regexp_replace(x, "\\$|,", "")}, schema(ddf))

但它返回错误

head(ddf2)
ERROR Executor: Exception in task 0.0 in stage 13.0 (TID 13)
org.apache.spark.SparkException: R computation failed with
Error in (function (classes, fdef, mtable)  : 
unable to find an inherited method for function ‘regexp_replace’ for signature ‘"data.frame", "character", "character"’

1 个答案:

答案 0 :(得分:1)

使用dapply

ddf2 <- dapply(ddf, function(x) { as.data.frame(apply(x, MARGIN=2, function(y) gsub("\\$|,", "", y, perl=TRUE)), stringsAsFactors = FALSE) } , schema(ddf))

dapply期望R data.frame作为匿名函数的输出。

regexp_replace方法需要SparkDataFrame Column作为输入。

没有dapply的示例(仅替换a列的值):

withColumn(ddf,'a', regexp_replace(ddf$a, "\\$|,", ""))