使用tidyverse方案

时间:2017-06-13 22:01:05

标签: r nested tidyverse tibble

我有data.frame我希望将(按行)分组(重叠)到“重叠”“批次”,然后purrr:::map这些批次到函数。在下面的示例中,ddata.frame我想要分组和批处理:

set.seed(19)
n1 <- data.frame(c0= "N",c1 = rep("A",4),c2 = rep(c("i","j"),2), num = rnorm(4))
n2 <- data.frame(c0= "N", c1 = rep("B",6),c2 = rep(c("i","j"),3), num = rnorm(3))
y1 <- data.frame(c0 = "Y", c1 = rep("A",2),c2 = c("i","j"), num = rnorm(2))
y2 <- data.frame(c0 = "Y", c1 = rep("B",4),c2 = rep(c("i","j"),each = 2), num = rnorm(2))

d <- rbind(y1,y2,n1,n2)

以下是d

#   c0 c1  c2      num
# 1  Y  A  i -0.7447795
# 2  Y  A  j -0.2597870
# 3  Y  B  i -0.1830838
# 4  Y  B  i  0.5186300
# 5  Y  B  j -0.1830838
# 6  Y  B  j  0.5186300
# 7  N  A  i -1.1894537
# 8  N  A  j  0.3885812
# 9  N  A  i -0.3443333
# 10 N  A  j -0.5478961
# 11 N  B  i  0.9806622
# 12 N  B  j -0.2366460
# 13 N  B  i  0.8097397
# 14 N  B  j  0.9806622
# 15 N  B  i -0.2366460
# 16 N  B  j  0.8097397

子集配方是

  1. 子集c0 - &gt;授予群组YN
  2. c0=="N"子集中c1 - &gt;授予群组NANB
  3. 每个NANB的子集c2 - &gt;授予群组NAiNAjNBiNBj
  4. row_bind N?iY?iN?jY?j(其中?AB) - &gt ;给出最后4个数据子集
  5. 在R:

    subset.Yi <- d %>% filter(c0=="Y"& c2=="i")
    subset.Yj <- d %>% filter(c0=="Y"& c2=="j")
    
    list(
      d1 = d %>% filter(c0=="N" & c1 == "A", c2 == "i") %>% rbind(subset.Yi),
      d2 = d %>% filter(c0=="N" & c1 == "B", c2 == "i") %>% rbind(subset.Yi),
      d3 = d %>% filter(c0=="N" & c1 == "A", c2 == "j") %>% rbind(subset.Yj),
      d4 = d %>% filter(c0=="N" & c1 == "B", c2 == "j") %>% rbind(subset.Yj)
    ) %>% 
    tibble::tibble(batches = paste0("batch",1:length(.)),data = .) ->tmp
    

    如果c2匹配并不重要,我可以这样做:

    d %>% filter(.,c0 == "N") %>% 
      group_by(.,c1) %>% 
        do(batches = rbind(d[d$c0=="Y"],.)) -> tmp
    

    但那并不完全。先感谢您! 顺便说一句,我知道在tidyverse之外这是可行的但是当我为我的其余代码采用tidyverse方案时,我希望保持一致。

1 个答案:

答案 0 :(得分:0)

这是一个适用于这种情况的解决方案(尽管如此,从其他人那里看到其他可能更通用的方法会很棒。)

tmp <- d %>% 
  group_by(c2) %>% 
  nest(.key = c2) %>%
  mutate(c2 = map(c2,~ .x %>% 
                    filter(.,c0 == "N") %>% 
                    group_by (.,c1) %>% 
                    do(batches = bind_rows(
                      .x %>% filter(.,c0 == "Y") %>% select(-c1),
                      (.) %>%  select(-c1)  ))
                  ))

tmp这里将包含四个子集。然后,我可以做类似

的事情
tmp %>% unnest(c2) %>% .$batches %>% map(.,~sum(.$num)) %>% unlist

在4个子集的每一个中给出colSumnum

[1] -1.94302047  1.14452254 -0.08355576  1.62951506

旁注:从技术上讲,取消选择c1不是必需的,因为我是row_binding,因此数据框的一部分忽略了值c1(参见上面的子集配方和注释?) ,c1的值很混乱,所以我把它删除了。