我有以下数据框,其中包含来自1000个性别的数据,三个重复的身高测量值以及每个测量值的年龄。
data <- data.frame(
child_id = 1:1000,
sex = rbinom(n = 1000, size = 1, prob = 0.5),
height_5 = rnorm(1000, mean = 80, sd = 5),
height_6 = rnorm(1000, mean = 90, sd = 5),
height_7 = rnorm(1000, mean = 100, sd = 5),
age_5 = rnorm(1000, mean = 5.2, sd = 1.5),
age_6 = rnorm(1000, mean = 6.1, sd = 1.5),
age_7 = rnorm(1000, mean = 7.3, sd = 1.5)
)
data$sex <- factor(data$sex,
levels = c(0,1),
labels = c("Male", "Female"))
### Generate SOME MISSING VALUES -----
data$height_5[which(data$height_5 %in% sample(data$height_5, 25))] <- NA
data$height_6[which(data$height_6 %in% sample(data$height_6, 25))] <- NA
data$height_7[which(data$height_7 %in% sample(data$height_7, 25))] <- NA
我可以按以下方式在每次测量时生成zscores
data$ht5z <- scale(data$height_5, center = TRUE, scale = TRUE)
data$ht6z <- scale(data$height_6, center = TRUE, scale = TRUE)
data$ht7z <- scale(data$height_7, center = TRUE, scale = TRUE)
如何为每个性别和年份生成这些信息,例如如果性别=男性且年龄> = 3和<4,则为htzm3;如果性别=男性,年龄> = 4和<5,则为htzm4,等等。
答案 0 :(得分:1)
如何?
library(dplyr)
library(stringr)
library(tidyr)
data %>%
gather(key, value, age_5, age_6, age_7, height_5, height_6, height_7) %>%
separate(key, c("key", "obs_time"), "_") %>%
spread(key, value) %>%
mutate(whole_age = floor(age)) %>%
group_by(sex, whole_age) %>%
mutate(htz = scale(height),
sex_init = str_to_lower(str_extract(sex, "^.")),
sa = paste0("htz", sex_init, whole_age)) %>%
ungroup() %>%
spread(sa, htz)
首先,我们希望将数据整理整齐。
为此,我们首先将您的所有“年龄和身高”列汇总为两列:key
和value
。 key
然后以原始变量的名称作为值,value
接受相应变量下的值,其他变量照原样复制。现在的数据如下所示:
# A tibble: 6,000 x 4
child_id sex key value
<int> <fct> <chr> <dbl>
1 1 Male age_5 5.67
2 1 Male age_6 7.02
3 1 Male age_7 8.86
4 1 Male height_5 79.2
5 1 Male height_6 95.8
6 1 Male height_7 85.0
7 2 Male age_5 3.38
8 2 Male age_6 5.06
9 2 Male age_7 5.47
10 2 Male height_5 79.2
# ... with 5,990 more rows
第二,我们将key
列分为两列:key
和obs_time
,使用“ _”作为分隔符。现在的数据如下:
# A tibble: 6,000 x 5
child_id sex key obs_time value
<int> <fct> <chr> <chr> <dbl>
1 1 Male age 5 5.67
2 1 Male age 6 7.02
3 1 Male age 7 8.86
4 1 Male height 5 79.2
5 1 Male height 6 95.8
6 1 Male height 7 85.0
7 2 Male age 5 3.38
8 2 Male age 6 5.06
9 2 Male age 7 5.47
10 2 Male height 5 79.2
# ... with 5,990 more rows
第三,我们将值扩展为两个变量:age
和height
。现在的数据如下:
# A tibble: 3,000 x 5
child_id sex obs_time age height
<int> <fct> <chr> <dbl> <dbl>
1 1 Male 5 5.67 79.2
2 1 Male 6 7.02 95.8
3 1 Male 7 8.86 85.0
4 2 Male 5 3.38 79.2
5 2 Male 6 5.06 81.8
6 2 Male 7 5.47 102.
7 3 Male 5 5.04 80.4
8 3 Male 6 6.37 95.3
9 3 Male 7 7.01 97.4
10 4 Male 5 6.25 90.8
# ... with 2,990 more rows
第四至第七,我们将年龄类别whole_age
进行突变,然后按sex
和whole_age
进行分组,以便在我们进行缩放时将其分别应用于每个组。然后,我们在每个组中进行缩放,提取sex
的第一个缩写,并在一个称为sa
的列中构造与新缩放的值相对应的变量名称。然后,我们可以删除分组。现在的数据如下:
# A tibble: 3,000 x 9
child_id sex obs_time age height whole_age htz sex_init sa
<int> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 1 Male 5 5.67 79.2 5 -0.967 m htzm5
2 1 Male 6 7.02 95.8 7 0.345 m htzm7
3 1 Male 7 8.86 85.0 8 -1.20 m htzm8
4 2 Male 5 3.38 79.2 3 -0.580 m htzm3
5 2 Male 6 5.06 81.8 5 -0.681 m htzm5
6 2 Male 7 5.47 102. 5 1.55 m htzm5
7 3 Male 5 5.04 80.4 5 -0.829 m htzm5
8 3 Male 6 6.37 95.3 6 0.455 m htzm6
9 3 Male 7 7.01 97.4 7 0.529 m htzm7
10 4 Male 5 6.25 90.8 6 -0.0366 m htzm6
# ... with 2,990 more rows
最后,我们可以将数据分布到您请求的变量中。现在我们有了:
# A tibble: 3,000 x 32
child_id sex obs_time age height whole_age sex_init htzf0 htzf1 htzf10 htzf11 htzf2 htzf3
<int> <fct> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 Male 5 5.67 79.2 5 m NA NA NA NA NA NA
2 1 Male 6 7.02 95.8 7 m NA NA NA NA NA NA
3 1 Male 7 8.86 85.0 8 m NA NA NA NA NA NA
4 2 Male 5 3.38 79.2 3 m NA NA NA NA NA NA
5 2 Male 6 5.06 81.8 5 m NA NA NA NA NA NA
6 2 Male 7 5.47 102. 5 m NA NA NA NA NA NA
7 3 Male 5 5.04 80.4 5 m NA NA NA NA NA NA
8 3 Male 6 6.37 95.3 6 m NA NA NA NA NA NA
9 3 Male 7 7.01 97.4 7 m NA NA NA NA NA NA
10 4 Male 5 6.25 90.8 6 m NA NA NA NA NA NA
# ... with 2,990 more rows, and 19 more variables: htzf4 <dbl>, htzf5 <dbl>, htzf6 <dbl>,
# htzf7 <dbl>, htzf8 <dbl>, htzf9 <dbl>, htzm0 <dbl>, htzm1 <dbl>, htzm10 <dbl>, htzm11 <dbl>,
# htzm12 <dbl>, htzm2 <dbl>, htzm3 <dbl>, htzm4 <dbl>, htzm5 <dbl>, htzm6 <dbl>, htzm7 <dbl>,
# htzm8 <dbl>, htzm9 <dbl>