Question

这个问题是“为什么”，而不是如何。在下面的代码中，我试图理解为什么dplyr::mutate使用整个向量而不是其他自定义函数（f()）来评估一个自定义函数（g()）。 mutate到底在做什么？

set.seed(1);sum(rnorm(100, c(0, 10, 100)))
f=function(m) {
    set.seed(1)
    sum(rnorm(100, mean=m))
}
g <- function(m) sin(m)
df <- data.frame(a=c(0, 10, 100))
y1 <- mutate(df, asq=a^2, fout=f(a), gout=g(a))
y2 <- rowwise(df) %>%
    mutate(asq=a^2, fout=f(a), gout=g(a))
y3 <- group_by(df, a) %>%
    summarize(asq=a^2, fout=f(a), gout=g(a))

对于所有三列asq，fout和gout，评估在y2和y3中按行进行，结果相同。但是，对于所有三行，y1$fout为3640.889，这是评估sum(rnorm(100, c(0, 10, 100)))的结果。所以函数f()正在评估每一行的整个向量。

在其他地方mutate/transform in R dplyr (Pass custom function)曾问过一个密切相关的问题，但没有解释“为什么”。

Answer 1

sin和^是矢量化的，因此它们本身对每个单独的值进行操作，而不是对整个值向量进行操作。 f未向量化。但你可以做f = Vectorize(f)，它也可以对每个单独的值进行操作。

y1 <- mutate(df, asq=a^2, fout=f(a), gout=g(a))
y1

    a   asq     fout       gout
1   0     0 3640.889  0.0000000
2  10   100 3640.889 -0.5440211
3 100 10000 3640.889 -0.5063656

f = Vectorize(f)

y1a <- mutate(df, asq=a^2, fout=f(a), gout=g(a))
y1a

    a   asq        fout       gout
1   0     0    10.88874  0.0000000
2  10   100  1010.88874 -0.5440211
3 100 10000 10010.88874 -0.5063656

有关矢量化here，here和here的一些其他信息。

Answer 2

我们可以使用map遍历'a'的每个元素并应用函数f

library(tidyverse)
df %>%
    mutate(asq = a^2, fout = map_dbl(a, f), gout = g(a)) 
#    a   asq        fout       gout
#1   0     0    10.88874  0.0000000
#2  10   100  1010.88874 -0.5440211
#3 100 10000 10010.88874 -0.5063656

为什么R dplyr :: mutate与自定义函数不一致

2 个答案: