将数据框从每列的最小值扩展到最大值

时间:2017-07-03 01:03:54

标签: r dplyr

下面的可重复数据包含2个协变量(cov1cov2),2个动物(CatDog)和2个季节(Summer的随机值}和Winter)。

library(dplyr); library(tidyr)
set.seed(123)
dat <- data.frame(Season = rep(c("Summer", "Winter"), each = 100),
                  Species = rep(c("Cat", "Dog", "Cat", "Dog"), each = 50),
                  cov1 = sample(1:100, 200, replace = TRUE),
                  cov2 = sample(1:100, 200, replace = TRUE))

head(dat)
  Season Species cov1 cov2
1 Summer     Cat   29   24
2 Summer     Cat   79   97
3 Summer     Cat   41   61
4 Summer     Cat   89   52
5 Summer     Cat   95   41
6 Summer     Cat    5   89

我想创建一个新的df,其中包含每个Season / Species组合的最小值到最大值的序列。我最初的想法是首先使用dplyr来识别最小值和最大值。

RangeDat <- dat %>% group_by(Season, Species) %>% 
  summarise_each(funs(min, max)) %>%
  as.data.frame()

> RangeDat
  Season Species cov1_min cov2_min cov1_max cov2_max
1 Summer     Cat        3        5      100       97
2 Summer     Dog        1        1       99       99
3 Winter     Cat        2        1       99      100
4 Winter     Dog       12        2       99      100

从这里我不知道如何扩展df。理想情况下,结果df将有4列(Season,Species,cov1,cov2)。 cov1cov2的值范围从每个季节/种类组合的最小值到最大值。与最初的dat df一样,SeasonSpecies的值会针对cov1cov2的增加值重复df。

参考评论,是否可以包含一个NA值,其中物种/季节组合的长度小于“最大”范围?

非常感谢任何建议!

1 个答案:

答案 0 :(得分:5)

我们可以在summarise

list
library(dplyr)
dat %>%
    group_by(Season, Species) %>% 
    summarise(cov1 = list(min(cov1):max(cov1)), cov2 = list(min(cov2):max(cov2)))

data.table

library(data.table)
setDT(dat)[, .(cov1 = list(min(cov1):max(cov1)),
               cov2 = list(min(cov2):max(cov2))), by = .(Season, Species)]

更新

由于OP提到通过使用length填充来保持NA相同,因此dplyr的一个选项将是

f1 <- function(x1, x2){
         x1 <- min(x1):max(x1)
          x2 <- min(x2):max(x2)
          m1 <- max(c(length(x1), length(x2)))
          length(x1) <- m1
          length(x2) <- m1
          list(cov1 = x1, cov2 = x2)
         }

dat %>%
    group_by(Season, Species) %>% 
    do(data.frame(Season = .$Season[1], Species = .$Species[1],  f1(.$cov1, .$cov2)))
# A tibble: 396 x 4
# Groups:   Season, Species [4]
#   Season Species  cov1  cov2
#   <fctr>  <fctr> <int> <int>
# 1 Summer     Cat     3     5
# 2 Summer     Cat     4     6
# 3 Summer     Cat     5     7
# 4 Summer     Cat     6     8
# 5 Summer     Cat     7     9
# 6 Summer     Cat     8    10
# 7 Summer     Cat     9    11
# 8 Summer     Cat    10    12
# 9 Summer     Cat    11    13
#10 Summer     Cat    12    14
# ... with 386 more rows

data.table可能的扩展名为

setDT(dat)[, f1(cov1, cov2), .(Season, Species)]
#     Season Species cov1 cov2
#  1: Summer     Cat    3    5
#  2: Summer     Cat    4    6
#  3: Summer     Cat    5    7
#  4: Summer     Cat    6    8
#  5: Summer     Cat    7    9
# ---                         
#392: Winter     Dog   NA   96
#393: Winter     Dog   NA   97
#394: Winter     Dog   NA   98
#395: Winter     Dog   NA   99
#396: Winter     Dog   NA  100