Question

将R中的lapply（）方法转换为spark.lapply（）时遇到问题。所以我的R代码看起来像这样

> lst <- lapply(1:(length(SampleData$A)-n), function(i) SampleData$A[i:(i+n)])
> names(lst) <- paste0("SampleData$A", seq_along(lst))
> list2env(lst, envir = .GlobalEnv)

我使用spark.lapply（）

在sparkR中集成相同的内容

count <- function(i) {
    df2$A[i:(i+n)]
}
lst <- spark.lapply(1:(length(df2$A)-n), count)

但是，我收到以下错误：

Error in writeType(con, serdeType) :
  Unsupported type for serialization nonstandardGenericFunction

我对sparkR相对较新，所以任何帮助都会受到赞赏。谢谢！

Answer 1

在使用spark.lapply的（有限）经验中，基本上你需要做的是确保你的命名空间是明确的; 特别是如果您使用外部包。

换句话说，您应该尝试明确spark.lapply需要注意的任何其他类型的变量，这些变量需要进入函数内部。虽然帮助文件说它通常会从全球环境中汲取一些东西，但这种方法可以让你在不工作的时候保持理智......

在伪代码中你的lapply应该看起来像这样

spark.lapply([(x1, y1), (x2, y2), (x3, y3)], function(x) do_stuff(x[1], x[2]))

do_stuff不应该依赖于自己环境之外的任何东西。根据我的经验，任何类型的选项，如option(na.pass)也需要在函数中定义。该手册还告诉您重新指定您可能已加载的任何库！

关于你的代码，我会改变它看起来像这样：

count <- function(i, df2) {
  df2$Sepal.Length[i:(i+n)]
}

df2 <- iris
n = 3

# creating a new list of parameters as in the code example above
# this will be:
# [(integer, dataframe)]
input_list <- lapply(1:(length(df2$Sepal.Length)-n), function(x) return(list(i=x, df2=df2)))

# doing what you did above
lst <- lapply(input_list, function(x) count(x$i, x$df2))
splst <- spark.lapply(input_list, function(x) count(x$i, x$df2))

如果您想使用环境变量，我建议您设置lapply，如下所示：

lst <- lapply(1:(length(df2$Sepal.Length)-n), function(x) count(x$i, df2))
splst <- spark.lapply(1:(length(df2$Sepal.Length)-n), function(x) count(x$i, df2))

它通常有效，但如果它们是非标准R类型的对象（例如xgb.Dmatrix个对象），有时会发生奇怪的事情。

lapply（）到spark.lapply（）转换问题

1 个答案: