dplyr summary()具有边缘情况的特殊函数

时间:2016-09-19 20:35:06

标签: r dplyr

我想在dplyr中进行分组summarise()操作,但是如果满足边缘情况则应用不同的函数。

我有数据看起来像这样。浓度和标准偏差计算如下:

library(dplyr)
testdata <- data_frame(sample = sort(rep(1:3, 4)),
                   volume = rep(c(1e-1, 1e-1, 1e-2, 1e-2), 3),
                   count = c(400, 400, 40, 40, 0, 0, 0, 0, 400, 400, 400, 400))

testdata %>% 
  group_by(sample) %>% 
  summarise(concentration = sum(count) / sum(volume),
            sd   = sqrt(sum(count)))

但是,在进行计算时,只计算25-250之间的值。我可以实现:

testdata %>% 
  group_by(sample) %>%
  filter((count >= 25) & (count <= 250)) %>%
  summarise(concentration = sum(count) / sum(volume),
            sd   = sqrt(sum(count)))

然后样品2&amp; 3没有集中注意力。

每个组的边缘情况可能会计算如下:

if (all(count <= 25)){
  summarise(concentration = 25 / min(volume),
            sd = NA)
}
else if (all(count >= 250)){
  summarise(concentration = 250 / max(volume),
            sd = NA)
}

这种边缘情况可以集成到summarise()函数中吗?

理想情况下,我还想要一个标志来表示返回结果=&#34; OK&#34;除了返回的边缘情况之外的所有情况:

if (all(count <= 25)){
  summarise(concentration = 25 / min(volume),
            sd = NA,
            result = "LOW")
}
else if (all(count >= 250)){
  summarise(concentration = 250 / max(volume),
            sd = NA,
            result = "HIGH")
}

1 个答案:

答案 0 :(得分:0)

一种方法是使用summarise

ifelse内对您的逻辑进行编码
library(dplyr)
result <- testdata %>% group_by(sample) %>%
                       summarise(concentration = ifelse(all(count <= 25), 
                                                        25 / min(volume),
                                                        ifelse(all(count >= 250), 
                                                               250 / max(volume), 
                                                               sum(count) / sum(volume))),
                                 sd = ifelse(all(count <= 25), 
                                             NA, 
                                             ifelse(all(count >= 250), 
                                                    NA, 
                                                    sqrt(sum(count)))),
                                 result = ifelse(all(count <= 25), 
                                                 "LOW", 
                                                 ifelse(all(count >= 250),
                                                        "HIGH", 
                                                        "OK")))
print(result)
### A tibble: 3 x 4
##  sample concentration       sd result
##   <int>         <dbl>    <dbl>  <chr>
##1      1          4000 29.66479     OK
##2      2          2500       NA    LOW
##3      3          2500       NA   HIGH

更新方法

另一种希望更接近OP要求的方法是定义一个函数:

summarise.func <- function(count, volume) {
  if (all(count <= 25)) {
    concentration <- 25 / min(volume)
    sd <- NA
    result <- "LOW"
  } else if (all(count >= 250)) {
    concentration <- 250 / max(volume)
    sd <- NA
    result <- "HIGH"
  } else {
    concentration <- sum(count) / sum(volume)
    sd <- sqrt(sum(count))
    result <- "OK"
  }
  data.frame(concentration=concentration, sd=sd, result=result, stringsAsFactors=FALSE)
}

处理常规案例和边缘案例。关键是此函数返回包含汇总结果的data.frame。然后,summarise将创建一个列,该列是包含这些数据框的列表,然后可以tidyr::unnest编辑:

library(dplyr)
library(tidyr)
result <- testdata %>% group_by(sample) %>%
                       summarise(csr=list(f(count, volume))) %>%
                       unnest(csr)
print(result)
### A tibble: 3 x 4
##  sample concentration       sd result
##   <int>         <dbl>    <dbl>  <chr>
##1      1          4000 29.66479     OK
##2      2          2500       NA    LOW
##3      3          2500       NA   HIGH