说我有多个时间点的变量,我想对所有时间点做一些操作。与在每个时间点上相比,我如何以更有效的方式做到这一点?在下面的示例中,我想要1)获取每个时间点选定列的总和,以及2)每个变量,查看从基线到所有时间点的变化量
#fake data to show what the dataset I receive looks like:
library(reshape2)
id=rep(c(1,1,1,2,2,2,3,3,3), 3)
time=c(rep("Time1",9), rep("Time2",9), rep("Time3",9))
test=rep(c("calcium","magnesium","zinc"), 9)
score=rnorm(n = 27, mean = 10, sd = 3)
fake <- data.frame(id, time, test, score)
fake <- dcast(fake, id ~ time + test)
#Task 1- Get total of selected columns at each time point
#Non-efficient method:
fake$totalmgcad1 <- rowSums(fake[,c("Time1_calcium", "Time1_magnesium")])
fake$totaldmgca2 <- rowSums(fake[,c("Time2_calcium", "Time2_magnesium")])
fake$totaldmgca3 <- rowSums(fake[,c("Time3_calcium", "Time3_magnesium")])
#Task 2 - Get change in calcium levels from baseline to each day
#Non-efficient method:
fake$calciumt1t2 <- fake$Time2_calcium - fake$Time1_calcium
fake$calciumt1t3 <- fake$Time3_calcium - fake$Time1_calcium
关于如何在更少的行中完成上述操作的任何提示?有没有办法使用group_by()呢,还是我需要列出列表并使用lapply()?
答案 0 :(得分:1)
对我来说,一个好的开始是将原始数据保持为长整齐的格式,例如:
library(tidyverse)
id <- c(rep(1,3), rep(2,3), rep(3,3))
set.seed(1) # for reproducible sample values
value <- rnorm(9)
param <- c(rep("calcium", 3), rep("magnesium", 3), rep("zinc", 3))
time <- rep(c(1,2,3), 3)
df <- data.frame(id, value, param, time)
as_tibble(df) #convenient way to see the data
# A tibble: 9 x 4
# id value param time
# <dbl> <dbl> <fct> <dbl>
#1 1 -0.626 calcium 1
#2 1 0.184 calcium 2
#3 1 -0.836 calcium 3
#4 2 1.60 magnesium 1
#5 2 0.330 magnesium 2
#6 2 -0.820 magnesium 3
#7 3 0.487 zinc 1
#8 3 0.738 zinc 2
#9 3 0.576 zinc 3
,然后如果您要查找的行数更少,则可以在另一个文件中定义一个函数(例如在function_defs.r
中定义,例如difference_from_baseline()
,因此在原始文件中您可以执行类似的操作找到适用于数学的正确函数后,请在主工作文件的一行中operated_on_desired_data <- difference_from_baseline(df)
行。
答案 1 :(得分:1)
您可能首先考虑将数据保留为长格式;也就是说,停在:
fake <- data.frame(id, time, test, score)
也不要dcast
。
现在您可以使用dplyr
函数。
library(dplyr)
例如,为所有测试的基线水平更改添加一列:
fake %>%
arrange(time) %>%
group_by(id, test) %>%
mutate(test_diff = score - lag(score))
要在每次添加一列钙和镁总和:
fake %>%
group_by(id, time) %>%
filter(test != "zinc") %>%
summarise(total_mgca = sum(score)) %>%
right_join(fake)
一起:
fake %>%
group_by(id, time) %>%
filter(test != "zinc") %>%
summarise(total_mgca = sum(score)) %>%
ungroup() %>%
right_join(fake) %>%
arrange(time) %>%
group_by(id, test) %>%
mutate(test_diff = score - lag(score)) %>%
ungroup()
结果:
id time total_mgca test score test_diff
1 1 Time1 21.64788 calcium 12.296461 NA
2 1 Time1 21.64788 magnesium 9.351419 NA
3 1 Time1 21.64788 zinc 6.897300 NA
4 2 Time1 25.16516 calcium 11.026712 NA
5 2 Time1 25.16516 magnesium 14.138449 NA
6 2 Time1 25.16516 zinc 4.462579 NA
7 3 Time1 15.39817 calcium 5.778935 NA
8 3 Time1 15.39817 magnesium 9.619240 NA
9 3 Time1 15.39817 zinc 4.976049 NA
10 1 Time2 29.97949 calcium 11.152820 -1.1436409
11 1 Time2 29.97949 magnesium 18.826667 9.4752480
12 1 Time2 29.97949 zinc 8.280754 1.3834534
13 2 Time2 32.65905 calcium 16.469051 5.4423387
14 2 Time2 32.65905 magnesium 16.190000 2.0515508
15 2 Time2 32.65905 zinc 10.781192 6.3186129
16 3 Time2 14.24311 calcium 3.843355 -1.9355800
17 3 Time2 14.24311 magnesium 10.399755 0.7805155
18 3 Time2 14.24311 zinc 7.868311 2.8922628
19 1 Time3 23.26662 calcium 9.325816 -1.8270041
20 1 Time3 23.26662 magnesium 13.940803 -4.8858643
21 1 Time3 23.26662 zinc 13.984667 5.7039133
22 2 Time3 16.67828 calcium 5.142377 -11.3266742
23 2 Time3 16.67828 magnesium 11.535903 -4.6540968
24 2 Time3 16.67828 zinc 13.057014 2.2758226
25 3 Time3 25.09958 calcium 14.158592 10.3152371
26 3 Time3 25.09958 magnesium 10.940988 0.5412329
27 3 Time3 25.09958 zinc 11.229914 3.3616030