在dmapply(ddR包)

时间:2017-05-09 12:59:33

标签: r dataframe parallel-processing aggregate distributed-computing

我想在<{3}}函数中运行aggregate函数,通过dmapply包提供。{/ p>

期望的结果

所需的结果反映了通过基数aggregate生成的简单输出:

aggregate(
  x = mtcars$mpg,
  FUN = function(x) {
    mean(x, na.rm = TRUE)
  },
  by = list(trans = mtcars$am)
)

产生:

  trans        x
1     0 17.14737
2     1 24.39231

尝试 - ddR

我希望在使用ddmapply时获得相同的结果,如下所示:

# ddR
require(ddR)

# ddR object creation
distMtcars <- as.dframe(mtcars)

# Aggregate / ddmapply
dmapply(
  FUN = function(x, y) {
    aggregate(FUN = mean(x, na.rm = TRUE),
              x = x,
              by = list(trans = y))
  },
  distMtcars$mpg,
  y = distMtcars$am,
  output.type = "dframe",
  combine = "rbind"
)

代码失败:

  

match.fun(FUN)中的错误:'mean(x, na.rm = TRUE)'不是   功能,字符或符号来自:match.fun(FUN)

更新

ddmapply指出的修复错误会删除错误,但不会产生所需的结果。代码:

# Avoid namespace conflict with other packages
ddR::collect(
  dmapply(
    FUN = function(x, y) {
      aggregate(
        FUN = function(x) {
          mean(x, na.rm = TRUE)
        },
        x = x,
        by = list(trans = y)
      )
    },
    distMtcars$mpg,
    y = distMtcars$am,
    output.type = "dframe",
    combine = "rbind"
  )
)

的产率:

[1] trans x    
<0 rows> (or 0-length row.names)

1 个答案:

答案 0 :(得分:2)

如果您将聚合函数更改为与之前调用的函数一致,那么它对我来说很正常:pscp user@host:/path/to/remote_file . pscp local_file user@host:/remote/dir 。它无法找到FUN = function(x) mean(x, na.rm = T)的原因是因为它不是一个函数(它是一个函数调用),而mean(x, na.rm = T)是一个函数。

除非您将mean更改为NA,否则它会为您提供x = distMtcars$mpg个结果。同样的y。尽管如此,我认为这应该适合你:

x = collect(distMtcars)$mpg

然后你可以res <-dmapply( FUN = function(x, y) { aggregate(FUN = function(x) mean(x, na.rm = TRUE), x = x, by = list(trans = y)) }, x = list(collect(distMtcars)$mpg), y = list(collect(distMtcars)$am), output.type = "dframe", combine = "rbind" ) 查看结果。

collect(res)