计算列B的唯一出现次数

时间:2017-10-29 11:27:06

标签: r

我有一个数据框:

   sample    gene
1 A1     Rim2
2 A1     CG18208
3 A1     Scr 
4 A1     Scr    # gene 'Scr' occurs twice in same sample 
5 A2     CG6959
6 A2     CG6959 # gene 'CG6959' occurs twice in same sample
n<-structure(list(sample = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("A1", 
"A2"), class = "factor"), gene = structure(c(4L, 1L, 3L, 
3L, 2L, 2L), .Label = c("CG18208", "CG6959", "Scr", "Rim2"), class = "factor")), .Names = c("sample", 
"gene"), row.names = c(NA, 6L), class = "data.frame")

我想知道所有genesamples的出现次数。

我目前正在使用表来计算每个基因发生的次数:

hit_genes<-table(n$gene)

CG18208  CG6959       Scr    Rim2 
      1       2       2       1

但是这给了我每个基因的总计数,而我试图在整个样本中获得的计数。对于这个玩具示例,我试图实现的结果是:

CG18208  CG6959       Scr    Rim2 
      1       1       1       1

我一直在尝试使用表格和唯一的组合:

table(n$gene[unique(n$sample),])

但我无法让它发挥作用。任何人都可以建议一种方法来实现这一目标吗?

2 个答案:

答案 0 :(得分:2)

你可以尝试,

table(n[!duplicated(n),]$gene)

#CG18208  CG6959     Scr    Rim2 
#      1       1       1       1 

答案 1 :(得分:0)

你可以试试这个:

library(dplyr)
library(tidyr)

n <- structure(list(sample = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("A1", "A2"), class = "factor"), gene = structure(c(4L, 1L, 3L, 3L, 2L, 2L), .Label = c("CG18208", "CG6959", "Scr", "Rim2"), class = "factor")), .Names = c("sample", "gene"), row.names = c(NA, 6L), class = "data.frame")

# make CG6959 appear also in A1 for the sake of illustration
n$sample[5] <- "A1"

n %>% 
  group_by(sample, gene) %>%
  summarize(gene2 = n()) %>%
  spread(sample, gene2) %>%
  mutate(Across = ifelse(is.na(A1) | is.na(A2), 0, 1)) %>%
  filter(Across > 0)

输出:

# A tibble: 1 x 4
    gene    A1    A2 Across
  <fctr> <int> <int>  <dbl>
1 CG6959     1     1      1

因此,如果您有许多基因,此代码可让您快速过滤并关注两个样本中出现的基因。