R-按组计算所有数据切割的差异

时间:2019-05-13 19:26:13

标签: r loops

我有一个包含多个属性和一个值的数据集。

输入(样本)

  GRP CAT TYP  VAL
    X   H   5 0.76
    X   A   2 0.34
    X   D   3 0.70
    X   I   3 0.33
    X   F   4 0.80
    X   E   1 0.39

我要:

  1. 确定CATTYP的所有组合
  2. 对于每个组合,计算删除组合时的平均值
  3. 返回最终差异表

最终表(示例)

   CAT TYP    DIFF
1 <NA>  NA 0.04000
2    H  NA 0.03206

行1表示如果未除去任何记录,则GRP='X'GRP='Y'的平均值之差为0.04。第2行表示,如果删除带有CAT='H'的记录,则差异为0.032。

我有有效的代码,但我想使其更快。我愿意接受您的建议。

工作代码

library(dplyr)

set.seed(777)

# build example data frame
df <- data.frame(GRP = c(rep('X',25),rep('Y',25)),
                 CAT = sample(LETTERS[1:10], 50, T),
                 TYP = sample(1:5, 50, T),
                 VAL = sample(1:100, 50, T)/100,
                 stringsAsFactors = F)

# table of all combinations of CAT and TYP
splits <- expand.grid(lapply(df[,-c(1,4)], function(x) c(NA, unique(x))), stringsAsFactors = F)

# null data frame to store results
ans <- data.frame(CAT = character(),
                  TYP = integer(),
                  DIFF = numeric(),
                  stringsAsFactors = F)

# loop through each combination and calculate the difference between group X and Y
for(i in 1:nrow(splits)) {

  split.i <- splits[i,]

  # determine non-na columns
  by.cols <- colnames(split.i)[unlist(lapply(split.i, function(x) !all(is.na(x))))]

  # anti-join to remove records that match `split.i`
  if(length(by.cols) > 0){
    df.i <- df %>%
      anti_join(split.i, by = by.cols)
  } else {
    df.i <- df
  }

  # calculate average by group
  df.i <- df.i %>%
    group_by(GRP) %>%
    summarize(VAL_MEAN = mean(VAL))

  # calculate difference of averages
  DIFF <- df.i[,2] %>%
    as.matrix() %>%
    diff() %>%
    as.numeric()

  ans.tmp <- cbind(split.i, DIFF)

  # bind to final data frame
  ans <- bind_rows(ans, ans.tmp)

}
return(ans)

速度结果

> system.time(fcnDiffCalc())
   user  system elapsed 
   0.30    0.02    0.31 

1 个答案:

答案 0 :(得分:1)

请考虑为{em> DIFF 分配sapply列,而不是在循环中增加数据帧,以避免重复进行内存中复制:

fcnDiffCalc2 <- function() {
  # table of all combinations of CAT and TYP
  splits <- data.frame(expand.grid(lapply(df[,-c(1,4)], function(x) c(NA, unique(x))), 
                       stringsAsFactors = F))

  # loop through each combination and calculate the difference between group X and Y
  splits$DIFF <- sapply(1:nrow(splits), function(i) {

    split.i <- splits[i,]

    # determine non-na columns
    by.cols <- colnames(split.i)[unlist(lapply(split.i, function(x) !all(is.na(x))))]

    # anti-join to remove records that match `split.i`
    df.i <- tryCatch(df %>%
        anti_join(split.i, by = by.cols), error = function(e) df)

    # calculate average by group
    df.i <- df.i %>%
      group_by(GRP) %>%
      summarize(VAL_MEAN = mean(VAL))

    # calculate difference of averages
    DIFF <- df.i[,2] %>%
      as.matrix() %>%
      diff() %>%
      as.numeric()
  })

  return(splits)
}

更好的是,避免在expand.grid中循环,在vapply上使用sapply(甚至unlist + lapply = sapply或{ {1}})定义结果结构,并避免循环中使用管道将其还原为基数R的vapply

aggregate

输出

fcnDiffCalc3 <- function() {
  # table of all combinations of CAT and TYP
  splits <- data.frame(expand.grid(CAT = c(NA, unique(df$CAT)), TYP = c(NA, unique(df$TYP)),
                                   stringsAsFactors = FALSE))

  # loop through each combination and calculate the difference between group X and Y
  splits$DIFF <- vapply(1:nrow(splits), function(i) {

    split.i <- splits[i,]

    # determine non-na columns
    by.cols <- colnames(split.i)[vapply(split.i, function(x) !all(is.na(x)), logical(1))]

    # anti-join to remove records that match `split.i`
    df.i <- tryCatch(anti_join(df, split.i, by = by.cols), error = function(e) df)

    # calculate average by group
    df.i <- aggregate(VAL ~ GRP, df.i, mean)

    # calculate difference of averages
    diff(df.i$VAL)

  }, numeric(1))

  return(splits)
}