我正在尝试模拟三个小数据集,其中包含x1
,x2
,x3
,x4
,trt
和IND
。
但是,当我尝试使用R中的“拆分”来分割模拟数据时,我得到警告消息和输出是正确的。有人可以给我一个暗示我在R代码中做错了什么吗?
# Step 2: simulate data
Alpha = 0.05
S = 3 # number of replicates
x = 8 # number of covariates
G = 3 # number of treatment groups
N = 50 # number of subjects per dataset
tot = S*N # total subjects for a simulation run
# True parameters
alpha = c(0.5, 0.8) # intercepts
b1 = c(0.1,0.2,0.3,0.4) # for pi_1 of trt A
b2 = c(0.15,0.25,0.35,0.45) # for pi_2 of trt B
b = c(1.1,1.2,1.3,1.4);
##############################################################################
# Scenario 1: all covariates are independent standard normally distributed #
##############################################################################
set.seed(12)
x1 = rnorm(n=tot, mean=0, sd=1);x2 = rnorm(n=tot, mean=0, sd=1);
x3 = rnorm(n=tot, mean=0, sd=1);x4 = rnorm(n=tot, mean=0, sd=1);
###############################################################################
p1 = exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4)/
(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
p2 = exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4)/
(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
p3 = 1/(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
# To assign subjects to one of treatment groups based on response probabilities
tmp = function(x){sample(c("A","B","C"), 1, prob=x, replace=TRUE)}
trt = apply(cbind(p1,p2,p3),1,tmp)
IND=rep(1:S,each=N) #create an indicator for split simulated data
sim=data.frame(x1,x2,x3,x4,trt, IND)
Aset = subset(sim, trt=="A")
Bset = subset(sim, trt=="B")
Cset = subset(sim, trt=="C")
Anew = split(Aset, f = IND)
Bnew = split(Bset, f = IND)
Cnew = split(Cset, f = IND)
警告信息:
> Anew = split(Aset, f = IND)
Warning message:
In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
data length is not a multiple of split variable
,输出变为
$`2`
x1 x2 x3 x4 trt IND
141 1.0894068 0.09765185 -0.46702047 0.4049424 A 3
145 -1.2953113 -1.94291045 0.09926239 -0.5338715 A 3
148 0.0274979 0.72971804 0.47194731 -0.1963896 A 3
$`3`
[1] x1 x2 x3 x4 trt IND
<0 rows> (or 0-length row.names)
我已经多次检查了我的R代码,但我无法弄清楚我做错了什么。非常感谢提前
答案 0 :(得分:5)
IND
是完整数据的全局变量sim
。您希望将特定的一个用于子集,例如
Anew <- split(Aset, f = Aset$IND)
答案 1 :(得分:4)
这是一个警告,而不是错误,这意味着split
已成功执行,但可能没有完成您想要做的事情。
从帮助文件的“详细信息”部分:
f is recycled as necessary and if the length of x is not a multiple of the length of f a warning is printed. Any missing values in f are dropped together with the corresponding values of x.
请尝试根据数据框的大小检查IND
的长度。
答案 2 :(得分:1)
一旦你的数据被拆分,不确定你的目标是什么,但这听起来像是plyr包的一个很好的候选者。
> library(plyr)
> ddply(sim, .(trt,IND), summarise, x1mean=mean(x1), x2sum=sum(x2), x3min=min(x3), x4max=max(x4))
trt IND x1mean x2sum x3min x4max
1 A 1 -0.49356448 -1.5650528 -1.016615 2.0027822
2 A 2 0.05908053 5.1680463 -1.514854 0.8184445
3 A 3 0.22898716 1.8584443 -1.934188 1.6326763
4 B 1 0.01531230 1.1005720 -2.002830 2.6674931
5 B 2 0.17875088 0.2526760 -1.546043 1.2021935
6 B 3 0.13398967 -4.8739380 -1.565945 1.7887837
7 C 1 -0.16993037 -0.5445507 -1.954848 0.6222546
8 C 2 -0.04581149 -6.3230167 -1.491114 0.8714535
9 C 3 -0.41610973 0.9085831 -1.797661 2.1174894
>
您可以将summarise
及其后续参数替换为任何返回data.frame
的函数或可以强制转换为某个函数的函数。如果列表是目标,ldply
就是您的朋友。