Question

我有一个多级结构，我需要做的是为每个人标准化（这是更高级别的单位，每个单位都有几个单独的措施）。

考虑：

  ID measure score
1  1       1     5
2  1       2     7
3  1       3     3
4  2       1    10
5  2       2     5
6  2       3     3
7  3       1     4
8  3       2     1
9  3       3     1

我使用apply(data, 2, scale)为每个人标准化（这也标准化了ID和度量，但没关系）。

但是，如何确保ID == 1，ID == 2和ID == 3单独标准化？ - ＆GT; 每个observation - mean of 3 scores除以standard deviation for 3 scores）。

我正在考虑一个for循环，但问题是我想引导它（换句话说，为一个大数据集复制整个过程1000次，所以速度非常重要）。

额外信息：ID可以有可变的测量值，因此它们都不具有3个测量分数。

数据的dput为：

structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), measure = c(1L, 
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), score = c(5L, 7L, 3L, 10L, 5L, 
3L, 4L, 1L, 1L)), .Names = c("ID", "measure", "score"), class = "data.frame", row.names = c(NA, 
-9L))

Answer 1

以下是lapply split解决方案，并假设您的数据为DF

> lapply(split(DF[,-1], DF[,1]), function(x) apply(x, 2, scale))
$`1`
     measure score
[1,]      -1     0
[2,]       0     1
[3,]       1    -1

$`2`
     measure      score
[1,]      -1  1.1094004
[2,]       0 -0.2773501
[3,]       1 -0.8320503

$`3`
     measure      score
[1,]      -1  1.1547005
[2,]       0 -0.5773503
[3,]       1 -0.5773503

产生相同结果的替代方案是：

> simplify2array(lapply(split(DF[,-1], DF[,1]), scale))

此备选方案可避免在apply调用中使用lapply。

此处split将数据划分为ID定义的组，并返回一个列表，因此您可以使用lapply循环遍历应用scale的列表中的每个元素

使用来自plyr的ddply作为@Roland建议：

> library(plyr)
> ddply(DF, .(ID), numcolwise(scale))
  ID measure      score
1  1      -1  0.0000000
2  1       0  1.0000000
3  1       1 -1.0000000
4  2      -1  1.1094004
5  2       0 -0.2773501
6  2       1 -0.8320503
7  3      -1  1.1547005
8  3       0 -0.5773503
9  3       1 -0.5773503

导入您的数据（这是为了回答最后的评论）

DF <- read.table(text="  ID measure score
1  1       1     5
2  1       2     7
3  1       3     3
4  2       1    10
5  2       2     5
6  2       3     3
7  3       1     4
8  3       2     1
9  3       3     1", header=TRUE)

使用R标准化不是列，而是列的小部分

1 个答案: