在三列中的两列中找到共同点,在R中总结它们

时间:2018-03-29 01:22:59

标签: r dplyr reshape tidyr

我有以下数据框:

    genus_sub <- structure(list(GutREF001.1_MDA_1 = c(0, 1, 0, 0, 0, 0, 0, 0, 
0, 0), GutREF001.1_MDA_2 = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0), GutREF001.1_MDA_3 = c(0, 
1, 0, 0, 0, 0, 0, 0, 0, 0), GutREF001.2_MDA_1 = c(0, 1, 0, 0, 
0, 0, 0, 0, 0, 0), GutREF001.2_MDA_2 = c(0, 1, 0, 0, 0, 0, 0, 
0, 0, 0), GutREF001.2_MDA_3 = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0), 
    ID = c("Enterococcaceae (B; Firm)", "Oscillospiraceae (B; Firm)", 
    "Enterobacteriaceae (B; Prot)", "Helicobacteraceae (B; Prot)", 
    "Peptoniphilaceae (B; Firm)", "Flavobacteriaceae (B; Bact)", 
    "Methanobacteriaceae (A; Eury)", "Coriobacteriaceae (B; Acti)", 
    "Micrococcaceae (B; Acti)", "Lactobacillaceae (B; Firm)")), .Names = c("GutREF001.1_MDA_1", 
"GutREF001.1_MDA_2", "GutREF001.1_MDA_3", "GutREF001.2_MDA_1", 
"GutREF001.2_MDA_2", "GutREF001.2_MDA_3", "ID"), row.names = c("Enterococcaceae (B; Firm)", 
"Oscillospiraceae (B; Firm)", "Enterobacteriaceae (B; Prot)", 
"Helicobacteraceae (B; Prot)", "Peptoniphilaceae (B; Firm)", 
"Flavobacteriaceae (B; Bact)", "Methanobacteriaceae (A; Eury)", 
"Coriobacteriaceae (B; Acti)", "Micrococcaceae (B; Acti)", "Lactobacillaceae (B; Firm)"
), class = "data.frame")

由MDA_1,MDA_2和MDA_3分隔的相同列名称一式三份(技术重复样本)分析需要一次在三个这样的相同样本之间进行分析

我想计算:

我。共识 - 即对于每一行,确定50%样本中存在的ID(值== 1)或在这种情况下至少有三分之二

II。 Sample_consensus_detected - 从上面确定的共识集中,找到一式三份的单个样本中存在的ID数

III。 Sample_consensus_not_detected - 从上面确定的共识集中,找到一式三份的单个样本中不存在的ID数

IV。 Replicate_not_in_consensus - 存在于个别样本中但未达成共识

IV。 summary_metric_1 - (ii /(ii + iii))

诉summary_metric_2 =(iv /(ii + iv))

我编写了以下代码来开始总结三个组:

row.names(genus_sub) <- genus_table$ID
genus_sub$ID <- NULL

genus_sub %>% 
  gather(key, value) %>% 
  extract(key, c("sample_id", "rep"), "([[:alnum:]]+)_MDA_([[:alnum:]]+)") %>% 
  group_by(sample_id) %>% 
  summarize(sample_sum = sum(value))

但无法确定一种计算共识的方法,即三列中有两列中存在ID(== 1)的行的总和值。任何帮助表示赞赏。预期产量如下: Expected output is as follows:

1 个答案:

答案 0 :(得分:0)

您可以通过融合这样的数据来计算共识(请注意,这需要数据 之前删除ID列):

melted <- melt(genus_sub,id="ID")
melted$variable <- substr(melted$variable,1,nchar(as.character(melted$variable))-2)
melted %>%
  group_by(ID,variable) %>%
  summarize(value = sum(value)) %>%
  dcast(ID ~ variable, sum)

子字符串函数会删除列名称中的计数器(现在是融合数据表中variable的值),以便您可以按variable进行分组。如果您的示例中有超过9个样本可以达成共识,则可以使用更精细的gsub替换它。

输出在每列中给出ID =和= = 1的总和(因此,为了得到二进制共识,您希望将2或3转换为1,否则为0。

                              ID GutREF001.1_MDA GutREF001.2_MDA
1    Coriobacteriaceae (B; Acti)               0               0
2   Enterobacteriaceae (B; Prot)               0               0
3      Enterococcaceae (B; Firm)               0               0
4    Flavobacteriaceae (B; Bact)               0               0
5    Helicobacteraceae (B; Prot)               0               0
6     Lactobacillaceae (B; Firm)               0               0
7  Methanobacteriaceae (A; Eury)               0               0
8       Micrococcaceae (B; Acti)               0               0
9     Oscillospiraceae (B; Firm)               3               3
10    Peptoniphilaceae (B; Firm)               0               0