我有一个数据框:
sample gene
1 A1 Rim2
2 A1 CG18208
3 A1 Scr
4 A1 Scr # gene 'Scr' occurs twice in same sample
5 A2 CG6959
6 A2 CG6959 # gene 'CG6959' occurs twice in same sample
n<-structure(list(sample = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("A1",
"A2"), class = "factor"), gene = structure(c(4L, 1L, 3L,
3L, 2L, 2L), .Label = c("CG18208", "CG6959", "Scr", "Rim2"), class = "factor")), .Names = c("sample",
"gene"), row.names = c(NA, 6L), class = "data.frame")
我想知道所有gene
中samples
的出现次数。
我目前正在使用表来计算每个基因发生的次数:
hit_genes<-table(n$gene)
CG18208 CG6959 Scr Rim2
1 2 2 1
但是这给了我每个基因的总计数,而我试图在整个样本中获得的计数。对于这个玩具示例,我试图实现的结果是:
CG18208 CG6959 Scr Rim2
1 1 1 1
我一直在尝试使用表格和唯一的组合:
table(n$gene[unique(n$sample),])
但我无法让它发挥作用。任何人都可以建议一种方法来实现这一目标吗?
答案 0 :(得分:2)
你可以尝试,
table(n[!duplicated(n),]$gene)
#CG18208 CG6959 Scr Rim2
# 1 1 1 1
答案 1 :(得分:0)
你可以试试这个:
library(dplyr)
library(tidyr)
n <- structure(list(sample = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("A1", "A2"), class = "factor"), gene = structure(c(4L, 1L, 3L, 3L, 2L, 2L), .Label = c("CG18208", "CG6959", "Scr", "Rim2"), class = "factor")), .Names = c("sample", "gene"), row.names = c(NA, 6L), class = "data.frame")
# make CG6959 appear also in A1 for the sake of illustration
n$sample[5] <- "A1"
n %>%
group_by(sample, gene) %>%
summarize(gene2 = n()) %>%
spread(sample, gene2) %>%
mutate(Across = ifelse(is.na(A1) | is.na(A2), 0, 1)) %>%
filter(Across > 0)
输出:
# A tibble: 1 x 4
gene A1 A2 Across
<fctr> <int> <int> <dbl>
1 CG6959 1 1 1
因此,如果您有许多基因,此代码可让您快速过滤并关注两个样本中出现的基因。