我有兴趣将多个变量归一化为对照组的各自平均值。
说我有一个数据框,在这里我要从3种不同情况(对照,药物1,药物2)测量两个变量(得分1和得分2)。
df <- data.frame(Treatment=rep(c( "Control", "Drug 1",
"Drug 2"), each=6 ),
Score1=c(4,5,4,5,5,6,8,9,10,8,9,9,14,15,13,15,14,15),
Score2=c(1,2,1,2,3,3,8,8,9,9,8,8,14,14,15,12,14,15))
df
Treatment Score1 Score2
1 Control 4 1
2 Control 5 2
3 Control 4 1
4 Control 5 2
5 Control 5 3
6 Control 6 3
7 Drug 1 8 8
8 Drug 1 9 8
9 Drug 1 10 9
10 Drug 1 8 9
11 Drug 1 9 8
12 Drug 1 9 8
13 Drug 2 14 14
14 Drug 2 15 14
15 Drug 2 13 15
16 Drug 2 15 12
17 Drug 2 14 14
18 Drug 2 15 15
我想将每个得分标准化为对照组的平均值(针对该得分)。最终结果是:
df.normal <- df
x <- mean(df$Score1[df$Treatment=="Control"])
y <- mean(df$Score2[df$Treatment=="Control"])
df.normal$Score1_normalised <- df$Score1 / x
df.normal$Score2_normalised <- df$Score2 / y
df.normal
Treatment Score1 Score2 Score1_normalised Score2_normalised
1 Control 4 1 0.8275862 0.5
2 Control 5 2 1.0344828 1.0
3 Control 4 1 0.8275862 0.5
4 Control 5 2 1.0344828 1.0
5 Control 5 3 1.0344828 1.5
6 Control 6 3 1.2413793 1.5
7 Drug 1 8 8 1.6551724 4.0
8 Drug 1 9 8 1.8620690 4.0
9 Drug 1 10 9 2.0689655 4.5
10 Drug 1 8 9 1.6551724 4.5
11 Drug 1 9 8 1.8620690 4.0
12 Drug 1 9 8 1.8620690 4.0
13 Drug 2 14 14 2.8965517 7.0
14 Drug 2 15 14 3.1034483 7.0
15 Drug 2 13 15 2.6896552 7.5
16 Drug 2 15 12 3.1034483 6.0
17 Drug 2 14 14 2.8965517 7.0
18 Drug 2 15 15 3.1034483 7.5
我认为dplyr可以做到这一点,但是我一直在努力入门,因为我有20个变量,所以我希望有一个捷径,而不是漫长的路要走。
任何帮助将不胜感激!
答案 0 :(得分:0)
这是dplyr
+ tidyr
的工作流程。它可以很好地扩展,但是不幸的是,当您需要进行一些重塑时,它会变得有些复杂。
使用一些基本的dplyr
动词,您可以获取控制值并计算以"Score"
开头的任何列的均值。由于该数据框只有一行,因此您可以轻松地在归一化df
时使用这些均值。
library(dplyr)
control_means <- df %>%
filter(Treatment == "Control") %>%
summarise_at(vars(starts_with("Score")), mean)
df %>%
mutate(Score1_norm = Score1 / control_means$Score1,
Score2_norm = Score2 / control_means$Score2) %>%
head()
#> Treatment Score1 Score2 Score1_norm Score2_norm
#> 1 Control 4 1 0.8275862 0.5
#> 2 Control 5 2 1.0344828 1.0
#> 3 Control 4 1 0.8275862 0.5
#> 4 Control 5 2 1.0344828 1.0
#> 5 Control 5 3 1.0344828 1.5
#> 6 Control 6 3 1.2413793 1.5
但是,将其复制更多的分数列将很快变得过时。通常,您可以改为使用mutate_at
来减少重复,但是我认为这不太可行,因为您每次都需要引入不同的control_means
列。
相反,您可以将均值和数据重塑为长形,然后按得分1,得分2等的分组(不知道您还叫他们什么)
control_means_long <- control_means %>%
gather(key = group, value = mean_score)
control_means_long
#> group mean_score
#> 1 Score1 4.833333
#> 2 Score2 2.000000
df %>%
gather(key = group, value = score, starts_with("Score")) %>%
left_join(control_means_long, by = "group") %>%
mutate(score_norm = score / mean_score) %>%
head()
#> Treatment group score mean_score score_norm
#> 1 Control Score1 4 4.833333 0.8275862
#> 2 Control Score1 5 4.833333 1.0344828
#> 3 Control Score1 4 4.833333 0.8275862
#> 4 Control Score1 5 4.833333 1.0344828
#> 5 Control Score1 5 4.833333 1.0344828
#> 6 Control Score1 6 4.833333 1.2413793
您可能希望在此之后删除均值列。如果可以将其保留为该格式,那么就可以了。但是,如果您需要像开始时那样恢复宽大的形状,则必须进行几轮重塑。
计算之后,我将创建一列score_type
,以显示值是根据gather
进行测量或规范的。然后将该文本与该组粘贴在一起,以形成Score1_measured
,Score1_normed
等列。添加临时行号以使spread
正确匹配那些分数,然后将其重新变宽
df %>%
gather(key = group, value = measured, starts_with("Score")) %>%
left_join(control_means_long, by = "group") %>%
mutate(normed = measured / mean_score) %>%
select(-mean_score) %>%
gather(key = score_type, value = value, measured, normed) %>%
unite(group_and_type, group, score_type) %>%
group_by(group_and_type) %>%
mutate(row = row_number()) %>%
spread(key = group_and_type, value = value) %>%
select(-row) %>%
head()
#> # A tibble: 6 x 5
#> Treatment Score1_measured Score1_normed Score2_measured Score2_normed
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 Control 4 0.828 1 0.5
#> 2 Control 5 1.03 2 1
#> 3 Control 4 0.828 1 0.5
#> 4 Control 5 1.03 2 1
#> 5 Control 5 1.03 3 1.5
#> 6 Control 6 1.24 3 1.5
由reprex package(v0.2.1)于2019-02-19创建
答案 1 :(得分:0)
这是带有aggregate()和mapply()的:
> Medias <- aggregate(df[c("Score1", "Score2")], list(df$Treatment), mean)
> Medias
Group.1 Score1 Score2
1 Control 4.833333 2.000000
2 Drug 1 8.833333 8.333333
3 Drug 2 14.333333 14.000000
>
> mapply( function(x, y) {x / y}, x = df[c("Score1", "Score2")], y = Medias[Medias$Group.1 == "Control" , c("Score1", "Score2")])
Score1 Score2
[1,] 0.8275862 0.5
[2,] 1.0344828 1.0
[3,] 0.8275862 0.5
[4,] 1.0344828 1.0
[5,] 1.0344828 1.5
[6,] 1.2413793 1.5
[7,] 1.6551724 4.0
[8,] 1.8620690 4.0
[9,] 2.0689655 4.5
[10,] 1.6551724 4.5
[11,] 1.8620690 4.0
[12,] 1.8620690 4.0
[13,] 2.8965517 7.0
[14,] 3.1034483 7.0
[15,] 2.6896552 7.5
[16,] 3.1034483 6.0
[17,] 2.8965517 7.0
[18,] 3.1034483 7.5
>
希望有帮助。
答案 2 :(得分:0)
非常感谢您的建议!我应该更清楚地说明我在这里命名为“得分1和得分2”的变量实际上在我的数据集中被命名为一堆不同的东西,例如面积,数字,长度等。
最终对我有用的是dplyr和mapply的组合。尽管我感谢dplyr有用的提示Camille!
我获得了所有变量的平均值(按治疗分组),如下所示:
Means<- df %>% group_by(Treatment) %>%
summarise_each(funs(mean(., na.rm = TRUE)))
然后使用mapply通过控制处理均值对每个变量进行归一化:
normalised.df <-mapply( function(x,y) {x / y},
x = df[c("area", "number", "length")],
y = Means[Means$Treatment == "Control", c("area", "number", "length")])
非常感谢!