我有一个看起来像这样的数据集:
sample_id target_id tpm
1 SRR3884840X (A)n 0.00000
2 SRR3884840X (AC)n 0.00000
3 SRR3884840X (AG)n 0.00000
4 SRR3884840X (AT)n 15.54990
5 SRR3884840X (C)n 3.10997
6 SRR3884840X (CAA)n 6.21995
27357 SRR3884841X (A)n 0.00000e+00
27358 SRR3884841X (AC)n 0.00000e+00
27359 SRR3884841X (AG)n 0.00000e+00
27360 SRR3884841X (AT)n 0.00000e+00
27361 SRR3884841X (C)n 1. 76941e+01
27362 SRR3884841X (CAA)n 3.53882e+01
1094236 SRR3884878C comp78901_c0_seq3_1 916
1094237 SRR3884878C comp85230_c0_seq1_1 1002
1094238 SRR3884878C comp56944_c0_seq1_1 2285
example data:
> dput(droplevels(head(teX,15)))
structure(list(sample_id = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "SRR3884840X", class
"factor"),
target_id = structure(1:15, .Label = c("(A)n", "(AC)n", "(AG)n",
"(AT)n", "(C)n", "(CAA)n", "(CAAA)n", "(CAAAA)n", "(CAAAAA)n",
"(CCA)n", "(CCCCA)n", "(CCCCAA)n", "(CCCCCA)n", "(CCCGAA)n",
"(CCCTAA)n"), class = "factor"), tpm = c(0, 0, 0, 15.5499,
3.10997, 6.21995, 43.5396, 9.32992, 3.10997, 0, 0, 0, 0,
0, 0)), .Names = c("sample_id", "target_id", "tpm"), row.names = c(NA,
15L), class = "data.frame")
我想计算每个target_id的平均tpm(10个sample_id中的~12000 target_id),所以我使用了lapply:
teXmean <- tapply(teX$target_id , teX$tpm, mean)
这将返回一个如下所示的数组:
0 2.25378e-10 3.21558e-10 4.49778e-10 4.5942e-10 4.63221e-10 5.5716e-10 5.63276e-10 5.90753e-10 6.20478e-10
NA NA NA NA NA NA NA NA NA NA
有没有办法恢复每个平均值的target_id,或者我完全走错了路径?最终目标是计算每个target_id的平均值,删除包含字符串“comp”的target_id的条目,然后使用heatmap生成热图。
答案 0 :(得分:0)
您还可以使用dplyr
:
tab <- structure(list(sample_id = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "SRR3884840X", class ="factor"),
target_id = structure(1:15, .Label = c("(A)n", "(AC)n", "(AG)n",
"(AT)n", "(C)n", "(CAA)n", "(CAAA)n", "(CAAAA)n", "(CAAAAA)n",
"(CCA)n", "(CCCCA)n", "(CCCCAA)n", "(CCCCCA)n", "(CCCGAA)n",
"(CCCTAA)n"), class = "factor"),
tpm = c(0, 0, 0, 15.5499, 3.10997, 6.21995, 43.5396, 9.32992, 3.10997, 0, 0, 0, 0, 0, 0)),
.Names = c("sample_id", "target_id", "tpm"),
row.names = c(NA, 15L),
class = "data.frame")
library(dplyr)
#>
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
tab %>%
group_by(target_id) %>%
summarise(mean_tpm = mean(tpm))
#> # A tibble: 15 x 2
#> target_id mean_tpm
#> <fctr> <dbl>
#> 1 (A)n 0.00000
#> 2 (AC)n 0.00000
#> 3 (AG)n 0.00000
#> 4 (AT)n 15.54990
#> 5 (C)n 3.10997
#> 6 (CAA)n 6.21995
#> 7 (CAAA)n 43.53960
#> 8 (CAAAA)n 9.32992
#> 9 (CAAAAA)n 3.10997
#> 10 (CCA)n 0.00000
#> 11 (CCCCA)n 0.00000
#> 12 (CCCCAA)n 0.00000
#> 13 (CCCCCA)n 0.00000
#> 14 (CCCGAA)n 0.00000
#> 15 (CCCTAA)n 0.00000