计算组之间共享的值

时间:2015-03-04 00:34:24

标签: r count aggregate shared

这是一些虚拟数据:

class<-c("ab","ab","ad","ab","ab","ad","ab","ab","ad","ab","ad","ab","av")
otu<-c("ab","ac","ad","ab","ac","ad","ab","ac","ad","ab","ad","ac","av")
value<-c(0,1,12,13,300,1,2,3,4,0,0,2,4)
type<-c("b","c","d","a","b","c","d","d","d","c","b","a","a")
location<-c("b","c","d","a","b","d","d","d","d","c","b","a","a")
datafr1<-data.frame(class,otu,value,type,location)

如果组'location'和'type'中的任何复制为0,我想摆脱任何OTU,因为我对组内所有重复项之间共享的OTU感兴趣。

我想计算两件事。 一:组“位置”和类型'(丰度)之间共享的所有OTU的“价值”丰度百分比 二:计算每个类中共享的OTU数量(otu.freq)

需要注意的是,我希望OTU按“类”分类,而不是OTU名称(因为它没有意义)。

预期产出:

   class location type  abundance  otu.freq
    ab        a    a      79        2
    av        a    a      21        1
    ab        b    b     100        1
    ab        c    c     100        1
    ad        d    c     100        1
    ab        d    d      24        2         
    ad        d    d      76        2

我有一个更大的数据框,并尝试使用dplyr here,但我的RAM用完了,所以我不知道它是否有效。

下面@Akron提供的解决方案不计算丰度为0的情况,但它不会从该组中的其他重复项中删除该OTU。如果任何OTU的丰度为0,那么它不会在该组之间共享,我需要从丰度和otu.freq计算中完全折扣它。

library(dplyr)    
so_many_shared3<-datafr1 %>% 
      group_by(class, location, type) %>% 
      summarise(abundance=sum(value)/sum(datafr1[['value']])*100, otu.freq=sum(value !=0))


   class location type  abundance  otu.freq
1    ab        a    a  4.3859649     2
2    ab        b    b 87.7192982     1
3    ab        c    c  0.2923977     1
4    ab        d    d  1.4619883     2
5    ad        b    b  0.0000000     0
6    ad        d    c  0.2923977     1
7    ad        d    d  4.6783626     2
8    av        a    a  1.1695906     1

2 个答案:

答案 0 :(得分:1)

您的聚合函数中存在错误。如果你想计算otu的频率,你应该把otu放在&#34;〜&#34;标志。之后,您可以使用join库中的plyr函数合并它们

abund_shared_freq<-aggregate(otu~class+location+type,datafr1,length)
library(plyr)
join(abund_shared, abund_shared_freq, by=c("class", "location","type"), type="left")

输出:

  class location type  abundance otu
1    ab        a    a  4.3859649   2
2    ab        b    b 87.7192982   2
3    ab        c    c  0.2923977   2
4    ab        d    d  1.4619883   2
5    ad        b    b  0.0000000   1
6    ad        d    c  0.2923977   1
7    ad        d    d  4.6783626   2
8    av        a    a  1.1695906   1

答案 1 :(得分:1)

您可以使用data.table

一步完成此操作
library(data.table)
val = sum(datafr1$value)
setDT(datafr1)[order(class,type), list(abundance = 
               sum(value)/val*100, otu.freq = .N), 
               by = .(class, location, type)]

或使用dplyr

library(dplyr)
datafr1 %>% 
     group_by(class, location, type) %>% 
     summarise(abundance=sum(value)/sum(datafr1[['value']])*100, otu.freq=n())
 #   class location type  abundance otu.freq
 #1    ab        a    a  4.3859649        2
 #2    ab        b    b 87.7192982        2
 #3    ab        c    c  0.2923977        2
 #4    ab        d    d  1.4619883        2
 #5    ad        b    b  0.0000000        1
 #6    ad        d    c  0.2923977        1
 #7    ad        d    d  4.6783626        2
 #8    av        a    a  1.1695906        1

更新

根据新标准,我正在更新OP建议的代码(@ K.Brannen)

  datafr1 %>%
       group_by(class, location, type) %>% 
       summarise(abundance=sum(value)/sum(datafr1[['value']])*100, 
             otu.freq=sum(value !=0)) 

UPDATE2

基于更新的预期结果

  datafr1 %>%
       filter(value!=0) %>% 
       group_by(location, type) %>% 
       mutate(value1=sum(value)) %>% 
       group_by(class, add=TRUE) %>% 
       summarise(abundance=round(100*sum(value)/unique(value1)), 
                         otu.freq=n())
  #    location type class abundance otu.freq
  #1        a    a    ab        79        2
  #2        a    a    av        21        1
  #3        b    b    ab       100        1
  #4        c    c    ab       100        1
  #5        d    c    ad       100        1
  #6        d    d    ab        24        2
  #7        d    d    ad        76        2