我有一些分组数据具有非常不同的范围。我想按组标准化每个功能。另外,我想在任意大量的功能选择(由名称提供,例如下面的standardise.vars
)上执行此操作。 R中最好的方法是什么?
到目前为止,我的方法在以下愚蠢的例子中给出:
DT <- data.table(mtcars)
group.vars <- c('cyl', 'am')
setkeyv(DT, group.vars)
standardise.vars <- c('disp','hp')
mns <- DT[, lapply(.SD, mean), .SDcols = standardise.cols, by = group.vars]
sds <- DT[, lapply(.SD, sd), .SDcols = standardise.cols, by = group.vars]
merged <- merge(mns, sds, suffixes = c('.mean', '.sd'))
DT[merged, ]
这使我的标准化专栏在他们旁边印有手段和标准偏差。我现在需要为所有列(x - x.mean) / x.std
执行操作x
。
mpg cyl disp hp drat wt qsec vs am gear carb disp.mean hp.mean disp.sd hp.sd
1: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 135.8667 84.66667 13.969371 19.65536
2: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 135.8667 84.66667 13.969371 19.65536
3: 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 135.8667 84.66667 13.969371 19.65536
...
然而,我觉得这是一个糟糕的方法,我可以一步完成标准化。任何帮助,指向我滥用data.table
或使用dplyr
的可能性都非常赞赏。
这种使用scale
的方法很接近,但格式不合适(从list(...)
左右移除scale
会导致错误):
DT[, list(disp.scaled = list(scale(disp)),
hp.scaled = list(scale(hp))), by = .(cyl,am)]
cyl am disp.scaled
1: 4 0 0.7755062, 0.3531536,-1.1286597
2: 4 1 0.7026252,-0.7282640,-0.8747715,-1.0994162,-0.7136133, 1.3033057,
3: 6 0 1.1946100, 0.4570585,-0.8258343,-0.8258343
4: 6 1 0.5773503, 0.5773503,-1.1547005
5: 8 0 0.0331832, 0.0331832,-1.1391352,-1.1391352,-1.1391352, 1.5925615,
6: 8 1 0.7071068,-0.7071068
hp.scaled
1: -1.1532051, 0.5257259, 0.6274793
2: 0.4910526,-0.7007155,-1.3186693,-0.7448550,-0.7007155, 0.4027735,
3: -0.5719714,-1.1167062, 0.8443388, 0.8443388
4: -0.5773503,-0.5773503, 1.1547005
5: -0.5745432, 1.5237884,-0.4246623,-0.4246623,-0.4246623, 0.3247418,
6: -0.7071068, 0.7071068
这种使用dplyr
的方法非常接近但是使用group_by_
做了一些奇怪的事情(它适用于group_by
):
ans <- DT %>% group_by(cyl, am) %>%
mutate_each_(funs(scale), standardise.vars)
ans2 <- DT %>% group_by_(group.vars) %>%
mutate_each_(funs(scale), standardise.vars)
truth <- filter(DT,am==0,cyl==4) %>%
transmute((disp - mean(disp))/sd(disp))
cbind(DT[,.(cyl,am,disp)], ans[,disp], ans2[,disp], truth)[1:3]
cyl am disp V2 V3 (disp - mean(disp))/sd(disp)
1: 4 0 146.7 0.7755062 1.546750 0.7755062
2: 4 0 140.8 0.3531536 1.327187 0.3531536
3: 4 0 120.1 -1.1286597 0.556857 -1.1286597
答案 0 :(得分:3)
假设我们要通过standardise.vars
定义的组中的变量来标准化group.vars
中的变量:
DT <- data.table(mtcars)
group.vars <- c('cyl', 'am')
standardise.vars <- c('disp','hp')
我认为这个解决方案dplyr
可以解决它:
DT <- DT %>% group_by_(.dots=group.vars) %>%
mutate_each_(funs(scale), standardise.vars)
为了完整起见,您可以通过data.table
这样做:
myscale <- function(x){
(x - mean(x)) / sd(x)
}
DT[, standardise.vars := lapply(.SD, myscale), .SDcols = standardise.vars,
by = group.vars, with = FALSE]