我有一个三列数据框。我想找到哪些"样品"对于" group"中的任意集合中的值为TRUE。我使用UpSetR绘制交集,但现在我需要提取实际值。在示例中,例如,我可能想要获得组A中的TRUE但不是B或C中的样本。在第二个中,我想在A组和B组中获得TRUE,但不是C.我需要这样做在大量的样本和组中,我可以提供一个或多个组,并为该交集设置提取样本为真。
library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag(): dplyr, stats
# Here's what I have
have <- tibble::tribble(
~group, ~sample, ~value,
"a", "x", TRUE,
"a", "y", TRUE,
"a", "z", TRUE,
"b", "x", FALSE,
"b", "y", TRUE,
"b", "z", FALSE,
"c", "x", FALSE,
"c", "y", FALSE,
"c", "z", TRUE
)
have
#> # A tibble: 9 x 3
#> group sample value
#> <chr> <chr> <lgl>
#> 1 a x TRUE
#> 2 a y TRUE
#> 3 a z TRUE
#> 4 b x FALSE
#> 5 b y TRUE
#> 6 b z FALSE
#> 7 c x FALSE
#> 8 c y FALSE
#> 9 c z TRUE
# Get samples where value is true only in group a
have %>%
spread(group, value) %>%
filter(a & !b & !c) %>%
pull(sample) %>%
unique()
#> [1] "x"
# Get samples where value is true in A and B but not C
have %>%
spread(group, value) %>%
filter(a & b & !c) %>%
pull(sample) %>%
unique()
#> [1] "y"
答案 0 :(得分:3)
我猜你应该以点差格式保留你的数据。从那里,您可以离开连接元组的条件:
DF = spread(have, group, value)
condDF = data.frame(
id = 1:3,
a = TRUE,
b = c(FALSE, TRUE , TRUE),
c = c(FALSE, FALSE, TRUE) )
left_join(condDF, DF)
Joining, by = c("a", "b", "c")
id a b c sample
1 1 TRUE FALSE FALSE x
2 2 TRUE TRUE FALSE y
3 3 TRUE TRUE TRUE <NA>
我认为把它放在桌子上是最干净的,但如果你坚持使用样本载体......
left_join(condDF, DF) %>% group_by(id) %>% summarise(samples = list(setdiff(sample, NA)))
Joining, by = c("a", "b", "c")
# A tibble: 3 x 2
id samples
<int> <list>
1 1 <chr [1]>
2 2 <chr [1]>
3 3 <chr [0]>
(我在这里试过nest
,但输出过于复杂。)
对于OP的特殊情况......
我们可以使用replace
:
f = function(gs, dat = DF, all_gs = setdiff(names(dat), vn), vn = "sample"){
base_cond = all_gs %>% setNames(rep(FALSE, length(.)), .) %>%
as.list %>% as.data.frame
replace(base_cond, gs, TRUE) %>% left_join(DF) %>% pull(!! vn)
}
用法
> f("a")
Joining, by = c("a", "b", "c")
[1] "x"
> f(c("a", "b"))
Joining, by = c("a", "b", "c")
[1] "y"
或者在data.table ......
library(data.table)
DT = data.table(DF)
fdt = function(gs, dat = DT, all_gs = setdiff(names(dat), vn), vn = "sample"){
base_cond = all_gs %>% setNames(rep(FALSE, length(.)), .) %>% as.list
dat[replace(base_cond, gs, TRUE), on=all_gs, ..vn][[1]]
}
fdt("a")
# [1] "x"
fdt(c("a","b"))
# [1] "y"
答案 1 :(得分:1)
以下是dplyr
+ rlang
的函数,它在提供包含组时返回正确的交叉过滤器,或在all = TRUE
时返回所有正确的交叉过滤器组合。应该适用于任意数量的独特小组级别:
library(dplyr)
library(rlang)
library(tidyr)
inter_sets = function(groups, all = FALSE){
filter_sets = function(filter_expr){
have %>%
spread(group, value) %>%
filter(!!parse_quosure(filter_expr)) %>%
pull(sample) %>%
unique()
}
if(is_true(all)){
combins = unique(have$group) %>%
c(paste0("!", .)) %>%
combn(length(.)/2) %>%
t() %>%
as.data.frame() %>%
filter(apply(., 1, function(x) length(unique(gsub("!", "", x))) == ncol(.) & !(length(grep("!", x)) %in% c(0, ncol(.))))) %>%
unite("expressions", names(.), sep = " & ")
combins$value = sapply(combins$expressions, filter_sets)
return(combins)
}else if(is_false(all)){
combins = unique(have$group) %>%
{c(.[match(groups, .)], paste0("!", .[-match(groups, .)]))} %>%
paste(collapse = " & ")
return(filter_sets(combins))
}
}
<强>结果:强>
> inter_sets("a")
[1] "x"
> inter_sets(c("a", "b"))
[1] "y"
> inter_sets(c("a", "c"))
[1] "z"
> inter_sets(all = TRUE)
expressions value
1 a & b & !c y
2 a & c & !b z
3 a & !b & !c x
4 b & c & !a
5 b & !a & !c
6 c & !a & !b
注意:强>
“所有组合方法”的想法是找到所有分组交集组合,并删除不必要的组合,如a & b & c
或a & b & !a
;用paste
构造表达式,并通过首先通过parse_quosure
将它们解析为quosures并将结果作为向量返回来对所有表达式应用过滤器。