数据集存在age, gender, state, income, group
的数据。组表示每个用户所属的组:
group gender state age income
1 3 Female CA 33 $75,000 - $99,999
2 3 Male MA 41 $50,000 - $74,999
3 3 Male KY 32 $35,000 - $49,999
4 2 Female CA 23 $35,000 - $49,999
5 3 Male KY 25 $50,000 - $74,999
6 3 Male MA 21 $75,000 - $99,999
7 3 Female CA 33 $75,000 - $99,999
8 3 Male MA 41 $50,000 - $74,999
9 3 Male KY 32 $35,000 - $49,999
10 2 Female CA 23 $35,000 - $49,999
11 3 Male KY 25 $50,000 - $74,999
12 3 Female MA 21 $75,000 - $99,999
以上是虚拟数据,目标是使概念正确。
目标是按group, gender, income
分组并获取计数,并为每个组获取属于该组的用户的平均年龄。然后按以下结构设置数据:"扩展版本"
group male female CA MA KY $35,000 - $49,999 $50,000 - $74,999 $75,000 - $99,999 mean_age
2 0 2 2 0 0 2 1 0 23
...
以下是尝试:使用dplyr
> data %>% group_by(group,
+ gender,
+ state,
+ income) %>%
+ summarize(n()) %>%
+ mutate(mean_age = mean(age))
我也在探索spread
功能。
答案 0 :(得分:1)
您可以在summarize()
的一次通话中同时执行计数和平均值:
library(dplyr)
data %>% group_by(group,
gender,
state,
income) %>%
summarize(count = n(), mean_age = mean(age))
对于宽数据,样本中的变量名称不能唯一标识给定数据点的含义,因为唯一单位为group X gender X state X income
,但每group
只有一行。
由于您有两个摘要,摘要类型是唯一标识的附加层。因此,要将所有内容放在一行中,您将拥有[group]_[gender]_[state]_[income]_[summary]
之类的变量名称。例如,2_Female_CA_$35,000 - $49,999_count
。
可能有更好的广泛形状 - 您在宽数据框架上进行了哪种类型的计算?
答案 1 :(得分:1)
除了@ treysp的答案,您还可以使用unite
和spread
来创建一个广泛(且不实用)的表格。 (我只使用as.data.frame()
强制打印所有列。)
require(tidyverse);
df %>%
group_by(group, gender, state, income) %>%
summarize(n = n(), mean_age = mean(age)) %>%
unite(key, gender, state, income) %>%
spread(key, n) %>% as.data.frame();
# group mean_age Female_CA_$35,000 - $49,999 Female_CA_$75,000 - $99,999
#1 2 23 2 NA
#2 3 21 NA NA
#3 3 25 NA NA
#4 3 32 NA NA
#5 3 33 NA 2
#6 3 41 NA NA
# Female_MA_$75,000 - $99,999 Male_KY_$35,000 - $49,999
#1 NA NA
#2 1 NA
#3 NA NA
#4 NA 2
#5 NA NA
#6 NA NA
# Male_KY_$50,000 - $74,999 Male_MA_$50,000 - $74,999 Male_MA_$75,000 - $99,999
#1 NA NA NA
#2 NA NA 1
#3 2 NA NA
#4 NA NA NA
#5 NA NA NA
#6 NA 2 NA
#
df <- read.table(text =
"group gender state age income
1 3 Female CA 33 '$75,000 - $99,999'
2 3 Male MA 41 '$50,000 - $74,999'
3 3 Male KY 32 '$35,000 - $49,999'
4 2 Female CA 23 '$35,000 - $49,999'
5 3 Male KY 25 '$50,000 - $74,999'
6 3 Male MA 21 '$75,000 - $99,999'
7 3 Female CA 33 '$75,000 - $99,999'
8 3 Male MA 41 '$50,000 - $74,999'
9 3 Male KY 32 '$35,000 - $49,999'
10 2 Female CA 23 '$35,000 - $49,999'
11 3 Male KY 25 '$50,000 - $74,999'
12 3 Female MA 21 '$75,000 - $99,999'", header = T, row.names = 1)