dplyr表示长格式数据的组

时间:2015-07-20 14:23:16

标签: r dplyr mean

我无法弄清楚如何使用dplyr计算长格式数据的简单均值。

我的数据如下:

   hldid   idno sex diary age
1   1294 1294_1   2     1  39
2   1294 1294_1   2     2  39
3   1294 1294_2   1     1  43
4   1294 1294_2   1     2  43
...

有4个变量:hldid idno sex diary age idno个人标识符,但不是唯一键

每个人重复 2 次,每个diary填充一次。

我想要的是简单地按age计算sex均值。

你可以帮帮我吗?

我尝试过类似的事情:

 dta %>% 
   group_by(sex) %>%
   mutate( ng = n_distinct(idno)) %>%
   group_by(age, add=TRUE) %>%
   summarise(mean = n()/ng[1] )

但它不起作用。

数据:

dta = structure(list(hldid = c(1294, 1294, 1294, 1294, 1352, 1352, 
1352, 1352, 3741, 3741, 3741, 3741, 3809, 3809, 3809, 3809, 4037, 
4037, 4037, 4037), idno = c("1294_1", "1294_1", "1294_2", "1294_2", 
"1352_1", "1352_1", "1352_2", "1352_2", "3741_1", "3741_1", "3741_2", 
"3741_2", "3809_1", "3809_1", "3809_2", "3809_2", "4037_1", "4037_1", 
"4037_2", "4037_2"), sex = c(2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 
2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L), diary = c(1L, 
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
 2L, 1L, 2L), age = c(39L, 39L, 43L, 43L, 31L, 31L, 37L, 37L, 
33L, 33L, 37L, 37L, 34L, 34L, 37L, 37L, 41L, 41L, 32L, 32L)), .Names = c("hldid", 
"idno", "sex", "diary", "age"), class = c("grouped_df", "tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -20L), vars = list(hldid), drop = TRUE, indices = list(
0:3, 4:7, 8:11, 12:15, 16:19), group_sizes = c(4L, 4L, 4L, 
4L, 4L), biggest_group_size = 4L, labels = structure(list(hldid = c(1294, 
1352, 3741, 3809, 4037)), class = "data.frame", row.names = c(NA, 
-5L), .Names = "hldid", vars = list(hldid)))

快速更新

也许这不适用于此示例, 但我想到的这类问题如下:

想象一下,我们有这样的数据: 3名女性和2名男性,以及一个虚拟act变量。

如果我们这样做而不考虑计算mean的Long格式,我们就会遇到问题。

aggregate(act ~ sex, FUN = mean, data = dtaTime)

我们应该做的是:

aggregate(act ~ sex, FUN = sum, data = dtaTime)
6 / 2 # men 
10 / 3 # women 

数据

dtaTime = structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L), 
sex = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), act = c(1L, 
1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 
1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L)), .Names = c("id", "sex", 
"act"), class = "data.frame", row.names = c(NA, -25L))

1 个答案:

答案 0 :(得分:6)

你太复杂了,

dta %>% 
   group_by(sex) %>% 
   summarise(meanage = mean(age))

应该按性别给你平均年龄。

基础R替代方案:

aggregate(age ~ sex, dta, mean)

data.table替代方案:

library(data.table)
setDT(dta)[, .(meanage = mean(age)), by = sex]