我有一个像这样的大数据集:
my.df <- data.frame(Cond= rep(c("A", "B", "C", "D"), each = 4),
Gene = rep(c("Gene1", "Gene2", "Gene3", "Gene4"), 4),
Avg=sample(85:100, 16, replace = TRUE),
SE=sample(1:5, 16, replace = TRUE),
Val1=sample(1:50, 16),
Val2=sample(1:50, 16))
现在,对于每个基因,我想要对所有内容进行标准化(将“Avg”,“SE”,“Val1”和“Val2”的每个值除以Cond A的平均值。
我目前的想法是做这样的事情:
by(my.df[ , 3:6], Gene, #since I want to do my calculation on each Gene
lapply(function(x) #since I want to do my calculation on each value
但我不知道如何编写函数以使其获取当前值x
并将其除以该Gene的Cond A Avg值。
或者,我想到制作另一个包含Gene和Cond A Avg值的数据框:
CondAavg <- my.df[Cond =="A", c("Gene","Avg")]
然后尝试使用sapply将函数应用于“Gene”的每个值,但我也没有完全看到如何使这个工作。
我显然对R很新,所以任何建议都会非常感激。
答案 0 :(得分:0)
Cond=="A"
的平均值。离开这里以防万一有人关心这件事。谢谢Agstudy。
您可以尝试:
norm.vec <- colMeans(subset(my.df, Cond=="A")[-(1:2)])
my.df[-(1:2)] <- t(t(my.df[-(1:2)]) / norm.vec)
这利用了回收利用(但我们需要转换它才能工作)。 head(df)
:
# Cond Gene Avg SE Val1 Val2
# 1 A Gene1 0.9470752 0.6153846 0.89655172 1.6752137
# 2 A Gene2 1.0473538 1.2307692 1.41379310 0.5811966
# 3 A Gene3 1.0473538 1.5384615 0.44827586 1.6068376
# 4 A Gene4 0.9582173 0.6153846 1.24137931 0.1367521
# 5 B Gene1 1.0250696 0.3076923 0.06896552 0.6495726
# 6 B Gene2 0.9582173 1.2307692 0.41379310 0.4444444
答案 1 :(得分:0)
我会使用merge
:
dtm = merge(subset(my.df,Cond!='A'),
subset(my.df,Cond=='A',select=c('Gene','Avg')),by='Gene')
Gene Cond Avg.x SE Val1 Val2 Avg.y
1 Gene1 B 97 4 9 29 88
2 Gene1 C 97 5 30 21 88
3 Gene1 D 94 5 19 39 88
4 Gene2 B 88 2 13 20 97
5 Gene2 C 98 5 20 43 97
6 Gene2 D 95 4 39 2 97
7 Gene3 B 93 5 40 50 89
8 Gene3 C 92 5 43 30 89
9 Gene3 D 91 3 27 11 89
10 Gene4 B 87 2 49 49 98
11 Gene4 C 97 3 6 47 98
12 Gene4 D 88 3 33 44 98
然后我将数字列除以最后一列:
dtm[,c(3:6)] <- dtm[,c(3:6)]/dtm[,'Avg.y']
Gene Cond Avg.x SE Val1 Val2 Avg.y
1 Gene1 B 1.1022727 0.04545455 0.10227273 0.32954545 88
2 Gene1 C 1.1022727 0.05681818 0.34090909 0.23863636 88
3 Gene1 D 1.0681818 0.05681818 0.21590909 0.44318182 88
4 Gene2 B 0.9072165 0.02061856 0.13402062 0.20618557 97
5 Gene2 C 1.0103093 0.05154639 0.20618557 0.44329897 97
6 Gene2 D 0.9793814 0.04123711 0.40206186 0.02061856 97
7 Gene3 B 1.0449438 0.05617978 0.44943820 0.56179775 89
8 Gene3 C 1.0337079 0.05617978 0.48314607 0.33707865 89
9 Gene3 D 1.0224719 0.03370787 0.30337079 0.12359551 89
10 Gene4 B 0.8877551 0.02040816 0.50000000 0.50000000 98
11 Gene4 C 0.9897959 0.03061224 0.06122449 0.47959184 98
12 Gene4 D 0.8979592 0.03061224 0.33673469 0.44897959 98
最好使用grep
来避免数字索引:
dtm[, !grepl('Gene|Cond',names(dtm))] =
dtm[, !grepl('Gene|Cond',names(dtm))] /dtm[,'Avg.y']
> dtm
Gene Cond Avg.x SE Val1 Val2 Avg.y
1 Gene1 B 1.1022727 0.04545455 0.10227273 0.32954545 1
2 Gene1 C 1.1022727 0.05681818 0.34090909 0.23863636 1
3 Gene1 D 1.0681818 0.05681818 0.21590909 0.44318182 1
4 Gene2 B 0.9072165 0.02061856 0.13402062 0.20618557 1
5 Gene2 C 1.0103093 0.05154639 0.20618557 0.44329897 1
6 Gene2 D 0.9793814 0.04123711 0.40206186 0.02061856 1
7 Gene3 B 1.0449438 0.05617978 0.44943820 0.56179775 1
8 Gene3 C 1.0337079 0.05617978 0.48314607 0.33707865 1
9 Gene3 D 1.0224719 0.03370787 0.30337079 0.12359551 1
10 Gene4 B 0.8877551 0.02040816 0.50000000 0.50000000 1
11 Gene4 C 0.9897959 0.03061224 0.06122449 0.47959184 1
12 Gene4 D 0.8979592 0.03061224 0.33673469 0.44897959 1
答案 2 :(得分:0)
以下是我使用plyr
包的方式:
library("plyr")
ddply(my.df, .(Gene), transform,
Avg.norm = Avg / Avg[Cond=="A"],
SE.norm = SE / SE[Cond=="A"],
Val1.norm = Val1 / Val1[Cond=="A"],
Val2.norm = Val2 / Val2[Cond=="A"])
我将标准化值放入新列中,但您可以轻松覆盖现有值。