如何在R中构造一个函数,其输入是一个data.frame,逐行输出是以内容为条件的?

时间:2016-05-11 02:18:39

标签: r function dataframe

考虑下表:

  V1 V2         V3         V4
1  A  X -0.2834111 -1.5095923
2  A  X  0.3114088 -0.1706417
3  B  Y  0.2544403 -0.4790589
4  B  X  0.6209947 -1.8988974
5  C  X  1.7428690 -0.2251725

我想写一个函数,它为每一行吐出计算,但计算取决于该行中各种变量的内容。例如。

If V1 = A, Output f(V3,V4)
If V1 = B, Output g(V3,V4)
If V1 = C, Output 0
If V1 = B AND V2 = Y, Output h(V3,V4)

其中f,g,h是适当的矢量化函数。编写函数的最佳方法是什么,该函数生成由一堆函数计算的输出向量,这些函数依赖于data.frame中列的规则和内容。

现在,我有一个包装函数,其输入是一个data.frame,然后将所需的列插入到main函数中,该函数根据条件调用子函数。

例如:

foo_wrapper <- function(x){
    foo(x$V1, x$V2, x$V3, x$V4)
}

主要功能是:

foo <- function(V1,V2,V3,V4){

#Define Functions
f <- function() ...  (some vectorized function)
g <- function() ...
h <- function() ...

#Produce results
res <- NA

res <- ifelse(V1 == "A", f(V1,V2), res)
res <- ifelse(V1 == "C", 0, res)
res <- ifelse(V1 == "B" & V2 != "Y", g(V3,V4), res)
res <- ifelse(V1 == "B" & V2 == "Y", h(V3,V4), res)

return(res)
}

这很慢,我确信有更好的方法。

非常感谢任何见解。

编辑:假设f,g,h为:

f <- function(){
    V3*V4
}

g <- function(){
    pmax(V3,V4)
}

h <- function(){
    exp(-1*V3)/(y+V4)
}

4 个答案:

答案 0 :(得分:2)

这是一种可能的优化 - 但没有太多真实数据就无法知道。

my_df <- read.table(header=TRUE, text=
"V1 V2         V3         V4
A  X -0.2834111 -1.5095923
A  X  0.3114088 -0.1706417
B  Y  0.2544403 -0.4790589
B  X  0.6209947 -1.8988974
C  X  1.7428690 -0.2251725")

## define functions outside the foo function - perhaps continual redefinition is slow
## use paste as a fake definition for testing
f <- function(x,y) {paste("f",x,y)} 
g <- function(x,y) {paste("g",x,y)} 
h <- function(x,y) {paste("h",x,y)} 

# define the function to applied
foo <- function(item){

  #Produce results, nested ifelse avoids reevaluation 
  res <- ifelse(item['V1'] == "A", f(item['V1'],item['V2']), 
           ifelse(item['V1'] == "C", 0, 
             ifelse(item['V1'] == "B" & item['V2'] != "Y", g(item['V3'],item['V4']), 
               ifelse(item['V1'] == "B" & item['V2'] == "Y", h(item['V3'],item['V4']), 
                      NA))))

  return(res)
}


apply(my_df, 1, foo)

[1] "f A X"                   "f A X"                   "h  0.2544403 -0.4790589" "g  0.6209947 -1.8988974"
[5] "0"                      

答案 1 :(得分:2)

ifelse()函数不是很快就知道了。直接索引通常更快

foo <- function(V1,V2,V3,V4){

    #Define Functions
    f <- function(x, y) paste(x,y)
    g <- function(x, y) pmax(x,y)
    h <- function(x, y) exp(-1*x)/(y+4)

    #Produce results
    res <- rep(0, length(V1))

    idx <- V1 == "A"
    res[idx] <- f(V1[idx],V2[idx])
    idx <- V1 == "B" & V2 != "Y"
    res[idx] <- g(V3[idx],V4[idx])
    idx <- V1 == "B" & V2 == "Y"
    res[idx] <- h(V3[idx],V4[idx])

    return(res)
}

这应该最小化计算次数。

答案 2 :(得分:2)

你也应该考虑这个:

假设:df是要考虑的数据帧。

library(data.table)

setDT(df)

test <- function(x){
    if (x$V1[1] == 'A')
        return (f(x$V3,x$V4))
    else if (x$V1[1] == 'C')
        return (rep(0,nrow(x)))
    else if (x$V1[1] == 'B' && x$V2[1] == 'Y')
        return (h(x$V3,x$V4))
    else
        return (g(x$V3,x$V4))
}

df[,test(.SD),by=c('V1','V2'),.SDcols = colnames(df)]

答案 3 :(得分:1)

由于某些原因,我觉得今天非常明确且人性化。这是我的解决方案:

## data
df <- data.frame(V1=c('A','A','B','B','C'),V2=c('X','X','Y','X','X'),V3=c(-0.2834111,0.3114088,0.2544403,0.6209947,1.7428690),V4=c(-1.5095923,-0.1706417,-0.4790589,-1.8988974,-0.2251725),stringsAsFactors=F);

## map of functions
funs <- list(
    zero=function(x,y) 0,
    mult=function(x,y) x*y,
    exp=function(x,y) exp(-1*x)/y,
    pmax=function(x,y) pmax(x,y)
);

## encapsulate logic that transforms V1,V2 space to function space
vgrp.to.fungrp <- function(V1,V2)
    ifelse(V1=='A','mult',
        ifelse(V1=='C','zero',
            ifelse(V1=='B',
                ifelse(V2=='Y','exp','pmax'),
                'error'
            )
        )
    );

## run it to get function grouping
fungrps <- vgrp.to.fungrp(df$V1,df$V2);
fungrps;
## [1] "mult" "mult" "exp"  "pmax" "zero"

## use ave() to run each represented function once for the set of rows that map to it
ave(seq_len(nrow(df)),fungrps,FUN=function(ri) funs[[fungrps[ri[1L]]]](df$V3[ri],df$V4[ri]));
## [1]  0.42783521 -0.05313933 -1.61848645  0.62099470  0.00000000