仅当某些列中的值是相同符号时,重复行的平均值(条件)

时间:2018-04-25 18:25:56

标签: r duplicates average rows plyr

我的数据如下:

gene_id      logFC          logCPM      LR          PValue      FDR
FBgn0000422 -1.875410209    4.429477429 25.16243497 5.27E-07    9.46E-05
FBgn0000422  1.262578335    4.429477429 11.65196417 0.000641348 0.022693702
FBgn0000422 -1.55793362     4.429477429 18.01707407 2.19E-05    0.00235694
FBgn0000565 -1.225082505    6.984450503 22.91546921 1.69E-06    0.000232455
FBgn0000565 -0.989958212    6.984450503 15.45759475 8.44E-05    0.006343374
FBgn0000565 -0.947467121    6.984450503 14.06298678 0.000176789 0.010290503
FBgn0001257 -1.135767061    6.745553159 33.67172953 6.52E-09    2.83E-06
FBgn0001257 -0.806003432    6.745553159 17.36036853 3.09E-05    0.003015214
FBgn0001257 -0.90371115     6.745553159 21.8449115  2.96E-06    0.000523406
FBgn0001291 -0.850144165    5.096971424 42.18504599 8.30E-11    8.08E-08
FBgn0001291 -0.892576562    5.096971424 47.27263627 6.18E-12    2.08E-08
FBgn0001291 -0.629617901    5.096971424 24.12565834 9.02E-07    0.000195886
FBgn0001301 -0.72615833     3.849906562 20.61723199 5.61E-06    0.000634277
FBgn0001301 -0.647614044    3.849906562 16.55276488 4.73E-05    0.004244782
FBgn0001301 -0.700985769    3.849906562 19.62582463 9.42E-06    0.001242629
FBgn0002719  0.39714033     8.153175244 9.467307643 0.002091661 0.045180557
FBgn0002719 -0.566665823    8.153175244 19.77575512 8.71E-06    0.001137708
FBgn0002719  0.509820318    8.153175244 15.96243465 6.46E-05    0.005084696

每个gene_id有3个重复项,我想对重复项进行平均,我可以使用plyr对以下代码进行处理:

AvL_univ_DOD_AVG<-ddply(AvL_univ_DOD,.(gene_id),colwise(mean,c("logFC","logCPM","LR","PValue","FDR")))

然而,我真正想要做的只是每个gene_id的平均值,如果&#34; logFC&#34;中的三个值。在gene_id中具有相同的符号(全部为负或全部为正)。

我不需要保留不符合此标准的那些。

1 个答案:

答案 0 :(得分:1)

如果在使用plyr之前过滤掉基因id在列logFC中既没有所有底片也没有所有正面的行? 例如。 with data.table:

library(data.table)

AvL_univ_DOD <- data.table(AvL_univ_DOD)
AvL_univ_DOD[,sign:=logFC>0]

#count how many duplicates you have for each gene_id
AvL_univ_DOD[,number_of_duplicates:=.N,by=gene_id] 

#count how many positives you have for each gene_id
AvL_univ_DOD[,number_of_pos:=sum(sign),by=gene_id] 

# keep only cases where you have all positives or all negatives
AvL_univ_DOD2 <- AvL_univ_DOD[number_of_pos==0|number_of_pos==number_of_duplicates] 

# apply plyr
AvL_univ_DOD_AVG<-ddply(AvL_univ_DOD2,.(gene_id),colwise(mean,c("logFC","logCPM","LR","PValue","FDR")))

enter image description here