R:从数据框

时间:2017-11-02 19:36:17

标签: r set dplyr tidyverse set-intersection

我有一个三列数据框。我想找到哪些"样品"对于" group"中的任意集合中的值为TRUE。我使用UpSetR绘制交集,但现在我需要提取实际值。在示例中,例如,我可能想要获得组A中的TRUE但不是B或C中的样本。在第二个中,我想在A组和B组中获得TRUE,但不是C.我需要这样做在大量的样本和组中,我可以提供一个或多个组,并为该交集设置提取样本为真。


library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats

# Here's what I have
have <- tibble::tribble(
  ~group, ~sample, ~value,
      "a",     "x",  TRUE,
      "a",     "y",  TRUE,
      "a",     "z",  TRUE,
      "b",     "x", FALSE,
      "b",     "y",  TRUE,
      "b",     "z", FALSE,
      "c",     "x", FALSE,
      "c",     "y", FALSE,
      "c",     "z",  TRUE
  )

have
#> # A tibble: 9 x 3
#>   group sample value
#>   <chr>  <chr> <lgl>
#> 1     a      x  TRUE
#> 2     a      y  TRUE
#> 3     a      z  TRUE
#> 4     b      x FALSE
#> 5     b      y  TRUE
#> 6     b      z FALSE
#> 7     c      x FALSE
#> 8     c      y FALSE
#> 9     c      z  TRUE

# Get samples where value is true only in group a
have %>%
  spread(group, value) %>%
  filter(a & !b & !c) %>%
  pull(sample) %>%
  unique()
#> [1] "x"

# Get samples where value is true in A and B but not C
have %>%
  spread(group, value) %>%
  filter(a & b & !c) %>%
  pull(sample) %>%
  unique()
#> [1] "y"

2 个答案:

答案 0 :(得分:3)

我猜你应该以点差格式保留你的数据。从那里,您可以离开连接元组的条件:

DF = spread(have, group, value)

condDF = data.frame(
  id = 1:3, 
  a = TRUE, 
  b = c(FALSE, TRUE , TRUE), 
  c = c(FALSE, FALSE, TRUE) )

left_join(condDF, DF)

Joining, by = c("a", "b", "c")
  id    a     b     c sample
1  1 TRUE FALSE FALSE      x
2  2 TRUE  TRUE FALSE      y
3  3 TRUE  TRUE  TRUE   <NA>

我认为把它放在桌子上是最干净的,但如果你坚持使用样本载体......

left_join(condDF, DF) %>% group_by(id) %>% summarise(samples = list(setdiff(sample, NA)))

Joining, by = c("a", "b", "c")
# A tibble: 3 x 2
     id   samples
  <int>    <list>
1     1 <chr [1]>
2     2 <chr [1]>
3     3 <chr [0]>

(我在这里试过nest,但输出过于复杂。)

对于OP的特殊情况......

  • 一次只传递一个条件
  • 仅指定TRUE的组(其他组隐式为FALSE)

我们可以使用replace

f = function(gs, dat = DF, all_gs = setdiff(names(dat), vn), vn = "sample"){
  base_cond = all_gs %>% setNames(rep(FALSE, length(.)), .) %>% 
    as.list %>% as.data.frame
  replace(base_cond, gs, TRUE) %>% left_join(DF) %>% pull(!! vn)
}

用法

> f("a")
Joining, by = c("a", "b", "c")
[1] "x"
> f(c("a", "b"))
Joining, by = c("a", "b", "c")
[1] "y"

或者在data.table ......

library(data.table)
DT = data.table(DF)

fdt = function(gs, dat = DT, all_gs = setdiff(names(dat), vn), vn = "sample"){
  base_cond = all_gs %>% setNames(rep(FALSE, length(.)), .) %>% as.list
  dat[replace(base_cond, gs, TRUE), on=all_gs, ..vn][[1]]
}

fdt("a")
# [1] "x"
fdt(c("a","b"))
# [1] "y"

答案 1 :(得分:1)

以下是dplyr + rlang的函数,它在提供包含组时返回正确的交叉过滤器,或在all = TRUE时返回所有正确的交叉过滤器组合。应该适用于任意数量的独特小组级别:

library(dplyr)
library(rlang)
library(tidyr)

inter_sets = function(groups, all = FALSE){

  filter_sets = function(filter_expr){
    have %>%
      spread(group, value) %>%
      filter(!!parse_quosure(filter_expr)) %>%
      pull(sample) %>%
      unique()
  } 

  if(is_true(all)){  

  combins = unique(have$group) %>%
    c(paste0("!", .)) %>%
    combn(length(.)/2) %>%
    t() %>%
    as.data.frame() %>%
    filter(apply(., 1, function(x) length(unique(gsub("!", "", x))) == ncol(.) & !(length(grep("!", x)) %in% c(0, ncol(.))))) %>%
    unite("expressions", names(.), sep = " & ")

  combins$value = sapply(combins$expressions, filter_sets)

  return(combins)

  }else if(is_false(all)){

  combins = unique(have$group) %>%
    {c(.[match(groups, .)], paste0("!", .[-match(groups, .)]))} %>%
    paste(collapse = " & ")

  return(filter_sets(combins))  
  }  
}

<强>结果:

> inter_sets("a")
[1] "x"

> inter_sets(c("a", "b"))
[1] "y"

> inter_sets(c("a", "c"))
[1] "z"

> inter_sets(all = TRUE)
  expressions value
1  a & b & !c     y
2  a & c & !b     z
3 a & !b & !c     x
4  b & c & !a      
5 b & !a & !c      
6 c & !a & !b      

注意:

“所有组合方法”的想法是找到所有分组交集组合,并删除不必要的组合,如a & b & ca & b & !a;用paste构造表达式,并通过首先通过parse_quosure将它们解析为quosures并将结果作为向量返回来对所有表达式应用过滤器。