我有一个data.table,有一堆参数(振幅,速率,面积等等,共有23个)属于特定的井(奇异实验,如果你愿意的话,总共有48个),按治疗分组(通常总共约10个),所有这些都在不同的时间点(可能有很多)。我想首先取每个井并按基线中位数参数(在“零”时间之前的所有时间点)对所有参数进行归一化(如,除以),然后取出标准化数据并再次标准化,但这次通过对照治疗组,每个时间点。我还想事先看一下基线和控制数据,并在必要时标记并删除异常值,然后进行标准化(虽然这在目前并不是非常重要;一旦我意识到如何完成,我可能会想到这一点归一化)
作为一个例子,我将创建一个类似的data.table,用于我在原始仪器数据分析代码中生成的内容:
dt = data.table(
wellID = as.factor(c ("A4", "B4", "C5", "D5", "A4", "B4", "C5", "D5","A4",
"B4", "C5", "D5")),
treatment = as.factor (c ("Control", "Control", "Drug", "Drug", "Control",
"Control", "Drug", "Drug", "Control", "Control", "Drug", "Drug")),
time_h = c (-0.2, -0.2, -0.2, -0.2, -0.1, -0.1, -0.1, -0.1, 4, 4, 4, 4),
area = runif (12, min = 0.5, max = 0.9),
amp = runif (12, min = 0.1, max = 0.2),
rate = runif (12, min = 33, max = 38)
)
我尝试过这样的事情:
baseline = subset (dt, subset = time_h < 0 )
隔离基线时间点,然后:
base_medians = by (baseline [ , (4: ncol (baseline)) ], baseline$ wellID,
function (x) {
apply (x, 2, median)
})
获取每口井的基线中位数,但是我真的不知道如何规范化dt中的数据以使井和参数匹配,然后进行第二次归一化?
我不认为这是一个好的策略,我应该以某种方式解构和重建我的数据集吗?
感谢任何帮助!
答案 0 :(得分:1)
这可能需要对子集进行一些调整,如果这不是您正在寻找的。这会将参数列除以time_h < 0
时的中位数值,然后再划分treatment == "Control"
set.seed(21) #good practice for questions so results are reproducible
parm <- c("area", "amp", "rate") #parameters to include
dt[, (parm) := lapply(.SD, function(x) x / median(x[time_h < 0])), .SDcols = parm]
dt[, (parm) := lapply(.SD, function(x) x / median(x[treatment == "Control"])), .SDcols = parm]
wellID treatment time_h area amp rate
1: A4 Control -0.2 0.9541129 0.7538275 0.9403151
2: B4 Control -0.2 0.7040382 1.1530667 1.0081769
3: C5 Drug -0.2 0.9134096 0.8369863 0.9780808
4: D5 Drug -0.2 0.6721809 0.7392173 1.0067250
5: A4 Control -0.1 1.0354136 1.0865999 0.9978287
6: B4 Control -0.1 1.0162338 0.9134001 0.9918002
7: C5 Drug -0.1 0.6334486 1.0678871 1.0280474
8: D5 Drug -0.1 0.6664317 1.1639014 0.9696164
9: A4 Control 4.0 1.0477798 0.7204991 1.0021713
10: B4 Control 4.0 0.9837662 1.1454020 1.0149003
11: C5 Drug 4.0 0.8985494 1.2648977 1.0190920
12: D5 Drug 4.0 1.0239782 1.3705952 0.9268626