Question

我想根据与其他列中的值相关的条件从数据框中的列中提取项目。 这些条件以列名与值相关联的列表的形式给出。最终目标是使用这些项目在另一个数据结构中按名称选择列。

以下是一个示例数据框：

> experimental_plan
  lib genotype treatment replicate
1   A       WT    normal         1
2   B       WT       hot         1
3   C      mut    normal         1
4   D      mut       hot         1
5   E       WT    normal         2
6   F       WT       hot         2
7   G      mut    normal         2
8   H      mut       hot         2

我的选择标准编码如下：

> ref_condition = list(genotype="WT", treatment="normal")

我想提取＆＃34; lib＆＃34;中的项目。该行与ref_condition匹配的列，即＆＃34; A＆＃34;和＆＃34; E＆＃34;。

1）我可以在我的选择标准列表中使用names来获取用于选择的列：

> experimental_plan[, names(ref_condition)]
  genotype treatment
1       WT    normal
2       WT       hot
3      mut    normal
4      mut       hot
5       WT    normal
6       WT       hot
7      mut    normal
8      mut       hot

2）我可以测试结果行是否符合我的选择标准：

> experimental_plan[, names(ref_condition)] == ref_condition
     genotype treatment
[1,]     TRUE      TRUE
[2,]     TRUE     FALSE
[3,]    FALSE      TRUE
[4,]    FALSE     FALSE
[5,]     TRUE      TRUE
[6,]     TRUE     FALSE
[7,]    FALSE      TRUE
[8,]    FALSE     FALSE
> selection_vector <- apply(experimental_plan[, names(ref_condition)] == ref_condition, 1, all)
> selection_vector
[1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

（我认为这一步，apply并不是特别优雅。必须有更好的方法。）

3）这个布尔矢量可用于选择相关的行：

> selected_lines <- experimental_plan[selection_vector,]
> selected_lines
  lib genotype treatment replicate
1   A       WT    normal         1
5   E       WT    normal         2

4）从现在开始，我知道如何使用dplyr来选择我感兴趣的项目：

> lib1 <- filter(selected_lines, replicate=="1") %>% select(lib) %>% unlist()
> lib2 <- filter(selected_lines, replicate=="2") %>% select(lib) %>% unlist()
> lib1
lib 
  A 
Levels: A B C D E F G H
> lib2
lib 
  E 
Levels: A B C D E F G H

可以在之前的步骤中使用dplyr（或其他聪明的技巧）吗？

5）这些项碰巧对应于另一个数据结构中的列名（此处名为counts_data）。我使用它们来提取相应的列并将它们放在一个列表中，与复制数字相关联作为名称：

> counts_1 <- counts_data[, lib1]
> counts_2 <- counts_data[, lib2]
> list_of_counts <- list("1" <- counts_1, "2" <- counts_2)

（理想情况下，我想概括代码，以便我不需要知道（我的意思是，＆＃34;硬编码它们＆＃34;）＆＃34;复制＆＃34中存在哪些不同的值; column：对于给定的＆＃34;基因型＆＃34;和＃34;治疗＆＃34;特征的组合，可以有任意数量的重复，我希望我的最终列表包含来自{{1}的数据与相应的＆＃34; lib＆＃34;项目有关。）

有没有办法更优雅/更有效地完成整个过程？

Answer 1

我认为您可以使用密钥

来使用data.table

library(data.table)
test <- data.table(lib = LETTERS[1:8],
           genotype = rep(c("WT","WT","mut","mut"),2),
           treatment = rep(c("normal","hot"),4),
           replicate = c(rep(1,4),rep(2,4)))
setkeyv(test,c("genotype","treatment"))
ref_condition = list(genotype="WT", treatment="normal")
test[ref_condition,lib]

这给出了

[1]＆＃34; A＆＃34; ＆＃34; E＆＃34;

当然，您可以使用lapply循环遍历测试条件列表。

使用作为（column_name = value）列表给出的标准从R数据框中提取项目

1 个答案: