我有一个包含200多个变量的数据框(下面是一个示例):
| x | P | Var1_mean | Var2_mean | Var3_mean | Var1_sd | Var2_sd | Var3_sd
------------------------------------------------------------------------------
1 | A | P1 | 100 | 50.47 | 298.2 | 2.33 | 0.04 | 8.77
2 | A | P2 | 98 | 18 | 350.33 | 2.32 | 0.04 | 10.3
3 | B | P1 | 100 | 30.93 | 152.73 | 2.33 | 0.04 | 4.49
4 | B | P2 | 100 | 25.33 | 237.67 | 2.33 | 0.04 | 6.99
5 | C | P1 | 99.9 | 25.07 | 184.93 | 2.32 | 0.04 | 5.44
6 | C | P2 | 100 | 18.33 | 132.33 | 2.32 | 0.04 | 3.89
每个变量对参考周期P1和测量周期P2都有N个观测值(A,B,C等)。
我希望每个观察结果为每个变量计算2个时期之间的差异,并将其除以参考期的stdev。
使用上面的例子:
df <- data.frame(x=c("A","A","B","B","C","C"),
P=c("P1","P2","P1","P2","P1","P2"),
Var1_mean=c(100.0,98,100.0,100.0,99.9,100.0),
Var2_mean = c(50.47,18,30.93,25.33,25.07,18.33),
Var3_mean = c(298.2,350.33,152.73,237.67,184.93,132.33),
Var1_sd = c(2.33,2.32,2.33,2.33,2.32,2.32),
Var2_sd = c(0.04,0.04,0.04,0.04,0.04,0.04),
Var3_sd = c(8.77,10.3,4.49,6.99,5.44,3.89))
Z.A.Var1 <- (df$Var1_mean[df$x=="A" & df$P=="P1"] - df$Var1_mean[df$x=="A" & df$P=="P2"])
/ df$Var1_sd[df$x=="A" & df$P=="P1"]
Z.A.Var2 <- (df$Var2_mean[df$x=="A" & df$P=="P1"] - df$Var2_mean[df$x=="A" & df$P=="P2"])
/ df$Var2_sd[df$x=="A" & df$P=="P1"]
等等。
我可以使用&#34; for&#34;计算循环,扫描观察和变量,但运行起来很麻烦且很慢。
是否有人建议如何以更智能的方式执行此操作,例如使用类似的dplyr操作系统?
答案 0 :(得分:1)
尝试使用parseInt
data.table
以上假设您的200多个变量位于library(data.table)
dat <- data.table(df)
dat[, .(calc_V1 = ((Var1_mean[P == "P1"] - Var1_mean[P == "P2"])/Var1_sd[P == "P1"]),
calc_V2 = ((Var2_mean[P == "P1"] - Var2_mean[P == "P2"])/Var2_sd[P == "P1"]),
calc_V3 = ((Var3_mean[P == "P1"] - Var3_mean[P == "P2"])/Var3_sd[P == "P1"])),
by = .(x)]
x calc_V1 calc_V2 calc_V3
1: A 0.85836910 811.75 -5.944128
2: B 0.00000000 140.00 -18.917595
3: C -0.04310345 168.50 9.669118
列中。相反,如果您的真实数据集非常宽(200多个变量中的每一个都有两列),您需要采用不同的方法:
x
答案 1 :(得分:0)
我总是尽量避免 for 循环。我会在两个表中发送P1和P2变量,在另一个表中发送SD。然后使用矩阵运算。像这样:
desired.stuff <- (P1.stuff - P2.stuff) / sd.stuff
此解决方案适用于任意数量的变量,只要P1和P2具有相同数量的变量,您就可以了。
这是使用data.table语法的解决方案:
library(data.table)
df <- data.table(df) #make a data.table
P1.stuff <- df[P=="P1", c(1:3), with=F] #select the P1 variables
P2.stuff <- df[P=="P2", c(3:5), with=F] #select the P2 variables
sd.stuff <- df[P=="P1", c(6:8), with=F)] #select de P1 SDs
desired.stuff <- P1.stuff - P2.stuff / sd.stuff
此外,由于您似乎正在进行Cohen的D计算,您可能需要查看effsize包: https://cran.r-project.org/web/packages/effsize/effsize.pdf
答案 2 :(得分:0)
可以使用tidyverse个包来重塑和聚合数据:
library(dplyr)
library(tidyr)
df.new <- gather(df, variable, value, -x, -P) %>% # convert data to 'long' format
separate(variable, c('variable', 'measure')) %>% # separate out variable number and measurement (mean and sd) columns
spread(measure, value) %>% # make each row have a mean and sd column
group_by(variable, x) %>% # group by variable and observation and...
summarize(result = diff(mean) / sd[P == 'P1']) # compute the requested summary, storing in 'result'
variable x result
<chr> <fctr> <dbl>
1 Var1 A -0.85836910
2 Var1 B 0.00000000
3 Var1 C 0.04310345
4 Var2 A -811.75000000
5 Var2 B -140.00000000
6 Var2 C -168.50000000
7 Var3 A 5.94412771
8 Var3 B 18.91759465
9 Var3 C -9.66911765