将字符向量作为参数传递给plyr中的函数

时间:2013-02-27 04:09:05

标签: r function vector plyr argument-passing

我怀疑我做错了,但我想将一个字符向量作为参数传递给ddply中的函数。有很多关于删除引号的Q& A等等,但似乎没有一个对我有用(例如Remove quotes from a character vector in Rhttp://r.789695.n4.nabble.com/Pass-character-vector-to-function-argument-td3045226.html)。

# reproducible data
df1<-data.frame(a=sample(1:50,10),b=sample(1:50,10),c=sample(1:50,10),d=(c("a","b","c","a","a","b","b","a","c","d")))
df2<-data.frame(a=sample(1:50,9),b=sample(1:50,9),c=sample(1:50,9),d=(c("e","f","g","e","e","f","f","e","g")))
df3<-data.frame(a=sample(1:50,8),b=sample(1:50,8),c=sample(1:50,8),d=(c("h","i","j","h","h","i","i","h")))

#make a list
list.1<-list(df1=df1,df2=df2,df3=df3)

# desired output
lapply(list.1, function(x)   ddply(x, .(d), function(x)  data.frame(am=mean(x$a), bm=mean(x$b), cm=mean(x$c))))

$df1
  d       am       bm       cm
1 a 31.00000 29.25000 18.50000
2 b 31.66667 24.33333 34.66667
3 c 18.50000  5.50000 24.50000
4 d 36.00000 39.00000 43.00000

$df2
  d       am       bm cm
1 e 18.25000 32.50000 18
2 f 27.66667 41.33333 24
3 g 25.00000  7.50000 42

$df3
  d       am       bm       cm
1 h 36.00000 25.00000 20.50000
2 i 25.33333 37.33333 24.33333
3 j 32.00000 32.00000 46.00000

但我的实际用例包含许多新列和不同类型的计算,我想在ddply函数中计算。所以我想做一些事情:

# here's a simple version of a function that I want to send to ddply    
func <- "am=mean(x$a), bm=mean(x$b), cm=mean(x$c)"

# here's how I imagine it might work
lapply(list.1, function(x)   ddply(x, .(d), function(x)  data.frame(func)) )

# not the desired outcome... 
$df1
  d                                     func
1 a am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
2 b am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
3 c am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
4 d am=mean(x$a), bm=mean(x$b), cm=mean(x$c)

$df2
  d                                     func
1 e am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
2 f am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
3 g am=mean(x$a), bm=mean(x$b), cm=mean(x$c)

$df3
  d                                     func
1 h am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
2 i am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
3 j am=mean(x$a), bm=mean(x$b), cm=mean(x$c)

我已尝试noquotedeparseeval(as.symbol())do.call(data.frame, ...)以及此处的一些方法:func上的https://github.com/hadley/devtools/wiki/Evaluation为否无济于事。解决方案在这一点上可能是显而易见的(即融化所有内容!),但如果不是,那么这是一个更接近我的用例的更长的例子:

# sample data
s <- 23 # number of samples
r <- 10 # number of runs per sample
el <- 17 # number of elements
mydata <- data.frame(ID = unlist(lapply(LETTERS[1:s], function(x) rep(x, r))),
                     run = rep(1:r, s))
# insert fake element data
mydata[letters[1:el]] <- lapply(1:el, function(i) rnorm(s*r, runif(1)*i^2))

# generate all combinations of 5 runs from  ten runs
su <- 5 # number of runs to sample from ten runs
idx <- combn(unique(mydata$run), su)

# RSE function
RSE <- function(x) {100*( (sd(x)/sqrt(length(x)))/mean(x) )}

# make a list of dfs for all samples for each combination of five runs
# to prepare to calculate RSEs
combys1 <- lapply(1:ncol(idx), function(i) mydata[mydata$run %in% idx[,i],] )

# make a list of dfs with RSE for each ID, for each combination of runs
combys2 <- lapply(1:length(combys1), function(i) ddply(combys1[[i]], "ID", summarise, RSEa=RSE(a), RSEb=RSE(b), RSEc=RSE(c), meana=mean(a), meanb=mean(b), meanc=mean(c)))

我想从上面用对象RSEa=RSE(a), RSEb=RSE(b), RSEc=RSE(c), meana=mean(a), meanb=mean(b), meanc=mean(c)替换上面最后一行中的doRSE,以避免大量输入:

# prepare to calculate new colums with RSE and means
RSEs <- sapply(3:ncol(mydata), function(j) paste0("RSE",names(mydata[j]))) 
RSExs <- sapply(3:ncol(mydata), function(j) paste0("RSE(",names(mydata[j]),")")) 
doRSE <- paste0(sapply(1:length(RSEs), function(x) paste0(RSEs[x],"=",RSExs[x])), collapse=",", sep="")

我愿意接受涉及基础,data.table和肮脏技巧的解决方案。似乎这些接近我想要的,但我不能完全将它们转化为我的问题: Pass character argument and evaluateForce evaluation of multiple variables using vector of characterUsing a vector of characters that correspond to an expression as an argument to a function

UPDATE 以下是捕获:我希望能够修改简单示例中的func(或者在我的用例中使用doRSE)来创建一堆由现有列上的各种计算产生的新列,用于探索数据。我想要一个工作流,允许生成的数据帧具有不在原始数据帧中的新列。对不起,在原始问题中并不是更清楚。我无法看到如何调整@Marius的答案来做到这一点,但@ mnel是有帮助的(见下面的更新)

使用@ mnel的优秀肮脏技巧,通过一些小修复我可以在我的用例中获得所需的结果:

# @mnel's solution, adapted (no period before eval)
combys2 <- lapply(combys1, function(x) do.call(ddply,c(.data = quote(x), 
                           .variables = quote(.(ID)), .fun = quote(summarize),
                           eval(parse(text = sprintf('.(%s)', doRSE ))))))
head(combys2)

[[1]]
   ID       RSEa      RSEb     RSEc      RSEd     RSEe      RSEf     RSEg      RSEh      RSEi
1   A  168.30658  21.68632 5.657228  5.048057 4.162017 2.9581874 1.849009 0.6925148 0.4393491
2   B   26.55071  26.20427 4.782578  4.385409 2.342764 2.1813874 2.719625 1.1576681 0.6427935
3   C   73.83165  14.47216 8.154435  6.273202 3.046978 1.2179457 2.811405 1.1401837 0.8167067
4   D   31.96170  57.89260 9.438220  7.388410 3.755772 0.8601780 3.724875 0.8358204 0.9939387
5   E   63.22537  60.35532 5.839690 11.691304 3.828430 0.9217787 4.204300 0.8217187 0.7876634
6   F   56.37635  65.37907 4.149568  5.496308 2.227544 2.1548455 2.847291 1.1956212 0.2506518
7   G   69.32232  23.63214 4.255847  7.979225 4.917660 1.6185960 3.156521 0.3265555 0.8133279
8   H   29.82015  40.74184 7.372100  7.464792 2.749862 0.6054420 4.061368 0.9973909 1.3807720
9   I   50.58114  19.53732 2.989920  9.767678 4.000249 1.7451322 1.175397 0.9952093 0.9095086
10  J   92.96462  39.77475 6.140688 10.295668 3.407726 2.4663758 3.030444 0.5743419 0.9296482
11  K   90.72381  42.25092 2.483069  6.781054 3.142082 1.8080633 2.891740 1.1996176 0.8525290
12  L -385.24547  40.81267 4.506087  8.148382 2.976488 0.8304432 2.234134 0.2108664 0.4979777
13  M   22.77743  33.98332 2.913926  8.764639 2.307293 0.8366635 3.229944 1.0003125 0.3878567
14  N   66.75163  34.16087 6.611326 13.865377 1.285522 1.3863958 4.165575 0.7379386 0.4515194
15  O   37.37188 100.57479 5.738877  5.724862 2.839638 1.1366610 3.186332 0.7383855 0.3954544
16  P   17.08913  26.62210 6.060130  4.110893 2.688908 2.6970727 1.609043 1.3860834 0.8780010
17  Q   13.96392  74.92279 5.469304  8.467638 2.974131 1.2135436 3.284564 0.6232778 1.0759226
18  R   42.59899  30.75952 4.842832  8.764158 1.874020 1.5791048 3.427342 1.4479638 0.2964455
19  S   26.03307  15.56352 6.968717  7.783876 4.439733 2.0764179 4.683080 0.7459654 1.1268772
20  T   71.57945  33.81362 7.147049 11.201551 2.128315 2.2051611 2.419805 0.2688807 1.1559635
21  U   73.93002  11.77155 7.738910  7.207041 1.478491 1.4409844 4.042419 0.5883490 0.5585716
22  V   67.93166  39.54994 5.701551  8.636122 2.472963 1.6514199 2.627965 1.0359048 0.8747136
23  W   11.23057  12.51272 7.003448  7.424559 4.102693 0.6614847 2.246305 1.3422405 0.2665246
        RSEj      RSEk      RSEl      RSEm      RSEn      RSEo      RSEp      RSEq
1  0.6366733 0.3713819 2.1993487 0.3865293 0.5436581 0.9187585 0.4344699 0.8915868
2  0.3445095 0.2932025 1.8563179 0.5397595 1.0433388 0.3533622 0.1942316 0.1941072
3  0.2720344 0.5507595 2.0305726 0.4377259 0.8589854 0.5690906 0.1397337 0.4043247
4  0.6606667 0.6769112 3.4737352 0.5674656 1.2519256 0.8718298 0.1162969 0.8287504
5  0.4620774 0.5598069 1.9236112 0.7990046 0.9832732 0.6847352 0.4070675 0.9005185
6  0.7981610 0.4005493 0.9721068 0.2770989 1.7054674 0.3110139 0.4521183 0.8740444
7  0.3969116 0.4717575 4.1341106 0.7510628 0.9998299 0.5342292 0.4319642 1.1861705
8  0.2963956 0.2652221 0.4775827 0.2617120 0.8261874 0.5266087 0.1900943 0.2350553
9  0.2609359 0.5431035 2.6478440 0.1606919 0.7407281 0.6802262 0.1802069 0.7438792
10 0.4239787 0.8753544 3.4218030 0.5467869 0.7404017 0.5581173 0.3682014 0.6361436
11 0.4188502 0.8629862 4.4181479 0.1623873 0.8018811 0.5873609 0.3592134 0.5357984
12 0.5790265 0.5009210 3.7534287 0.1933726 0.5809601 0.5777868 0.3400925 0.4783890
13 0.3562582 0.2552756 2.1393219 0.1849345 0.5796194 0.6129469 0.3363311 0.4382125
14 0.7921502 0.6147990 2.9054634 0.5852325 1.4954072 0.9983203 0.2937837 0.7654504
15 0.5840424 0.2757707 1.5695675 0.3305385 0.8712636 0.5816490 0.1985457 0.7213289
16 0.3301280 0.3008273 2.9014987 0.4540833 0.5966479 0.9042004 0.1631630 0.7262141
17 0.5882511 0.2820978 3.0652666 0.4518936 1.3168151 0.4749311 0.2244693 0.6583083
18 0.4048816 0.3708787 3.2207478 0.2603412 1.3168318 0.3318745 0.3120436 0.6210711
19 0.4425123 0.3602076 3.7609863 0.5399527 0.8302572 0.3246904 0.1952143 0.2915325
20 0.5877835 0.6339015 1.6908570 0.3223056 0.5239339 0.6607198 0.2808094 0.3697380
21 0.4454056 0.7733354 4.3433420 0.4391075 0.5503594 0.5893406 0.2262403 0.2361512
22 0.9583940 0.6365843 3.0033951 0.6507968 0.8610046 0.6363198 0.2866719 0.5736855
23 0.4969730 0.3895182 2.0021608 0.3354475 1.4398250 0.7386870 0.2458906 0.3414804
...
...

2 个答案:

答案 0 :(得分:4)

您可以使用quoteplyr::.

对该语言进行一些丑陋的计算

阅读https://github.com/hadley/devtools/wiki/Computing-on-the-language可能有助于了解您是否真的想要这样做。

无论如何,一种方法可能是使用

  1. 使用.()创建参数向量,例如使用汇总方式

    .(am=mean(a), bm=mean(b), cm=mean(c))
    

    如果你真的想使用字符串

    foo<- "am=mean(a), bm=mean(b), cm=mean(c)"
    eval(parse(text = sprintf('.(%s)', foo )))
    
  2. 自由地使用quote创建要传递给do.call的列表

  3. 例如

    lapply(list.1, function(x) do.call(ddply,c(.data = quote(x), 
        .variables = quote(.(d)), .fun = quote(summarize),
          .(am=mean(a), bm=mean(b), cm=mean(c)))))
    

    哦,小男孩真是太丑了。

    或者,您可以使用data.tables

    library(data.table)
    
    
    listDT <- lapply(list.1, data.table)
    
    
    lapply(listDT, function(x) x[,lapply(.SD, mean), by = 'd'])
    

    mystuff <- sprintf('list(%s)', foo)
    lapply(listDT, function(x) x[, eval(parse(text = mystuff)), by = 'd'])
    

    但是,如果你在所有data.tables中都有相同的列,那么创建一个大的data.table(列表的每个元素都有一个标识符)会更有效。并且可以使用它。

答案 1 :(得分:2)

这是一个ddply函数,用于计算数据框中不是d的所有列的平均值:

lapply(list.1,
       function(x) {
         ddply(
           x,
           .(d),
           function(df_part) {
             result_df <- data.frame(d=df_part$d[1])
             non_d_cols <- colnames(df_part)[! colnames(df_part) == "d"]
             for (col in non_d_cols) {
               col_mean <- mean(df_part[[col]])
               col_name <- paste0(col, "_mean")
               result_df[[col_name]] <- col_mean
             }
             return(result_df)
           })
       })

在我看来,这是最简单的方法,它应该很好地概括到你可能想对这些列进行的其他计算。也许您可以传入要计算其平均值的列的字符向量参数,并使用它来代替non_d_cols