根据不同的列

时间:2018-02-12 21:38:15

标签: r group-by dplyr

数据集存在age, gender, state, income, group的数据。组表示每个用户所属的组:

     group      gender state age       income
 1       3      Female  CA     33  $75,000 - $99,999
 2       3        Male  MA     41  $50,000 - $74,999
 3       3        Male  KY     32  $35,000 - $49,999
 4       2      Female  CA     23  $35,000 - $49,999
 5       3        Male  KY     25  $50,000 - $74,999
 6       3        Male  MA     21  $75,000 - $99,999
 7       3      Female  CA     33  $75,000 - $99,999
 8       3        Male  MA     41  $50,000 - $74,999
 9       3        Male  KY     32  $35,000 - $49,999
10       2      Female  CA     23  $35,000 - $49,999
11       3        Male  KY     25  $50,000 - $74,999
12       3      Female  MA     21  $75,000 - $99,999

以上是虚拟数据,目标是使概念正确。

目标是按group, gender, income分组并获取计数,并为每个组获取属于该组的用户的平均年龄。然后按以下结构设置数据:"扩展版本"

    group  male female CA  MA  KY  $35,000 - $49,999  $50,000 - $74,999 $75,000 - $99,999  mean_age
     2      0     2     2   0   0          2                1              0                   23
...

以下是尝试:使用dplyr

> data %>% group_by(group, 
+ gender, 
+ state, 
+ income) %>% 
+ summarize(n()) %>% 
+ mutate(mean_age = mean(age))

我也在探索spread功能。

2 个答案:

答案 0 :(得分:1)

您可以在summarize()的一次通话中同时执行计数和平均值:

library(dplyr)    

data %>% group_by(group, 
                  gender, 
                  state, 
                  income) %>% 
  summarize(count = n(), mean_age = mean(age))

对于宽数据,样本中的变量名称不能唯一标识给定数据点的含义,因为唯一单位为group X gender X state X income,但每group只有一行。

由于您有两个摘要,摘要类型是唯一标识的附加层。因此,要将所有内容放在一行中,您将拥有[group]_[gender]_[state]_[income]_[summary]之类的变量名称。例如,2_Female_CA_$35,000 - $49,999_count

可能有更好的广泛形状 - 您在宽数据框架上进行了哪种类型的计算?

答案 1 :(得分:1)

除了@ treysp的答案,您还可以使用unitespread来创建一个广泛(且不实用)的表格。 (我只使用as.data.frame()强制打印所有列。)

require(tidyverse);
df %>%
    group_by(group, gender, state, income) %>%
    summarize(n = n(), mean_age = mean(age)) %>%
    unite(key, gender, state, income) %>%
    spread(key, n) %>% as.data.frame();
#  group mean_age Female_CA_$35,000 - $49,999 Female_CA_$75,000 - $99,999
#1     2       23                           2                          NA
#2     3       21                          NA                          NA
#3     3       25                          NA                          NA
#4     3       32                          NA                          NA
#5     3       33                          NA                           2
#6     3       41                          NA                          NA
#  Female_MA_$75,000 - $99,999 Male_KY_$35,000 - $49,999
#1                          NA                        NA
#2                           1                        NA
#3                          NA                        NA
#4                          NA                         2
#5                          NA                        NA
#6                          NA                        NA
#  Male_KY_$50,000 - $74,999 Male_MA_$50,000 - $74,999 Male_MA_$75,000 - $99,999
#1                        NA                        NA                        NA
#2                        NA                        NA                         1
#3                         2                        NA                        NA
#4                        NA                        NA                        NA
#5                        NA                        NA                        NA
#6                        NA                         2                        NA
#

样本数据

df <- read.table(text =
    "group      gender state age       income
 1       3      Female  CA     33  '$75,000 - $99,999'
 2       3        Male  MA     41  '$50,000 - $74,999'
 3       3        Male  KY     32  '$35,000 - $49,999'
 4       2      Female  CA     23  '$35,000 - $49,999'
 5       3        Male  KY     25  '$50,000 - $74,999'
 6       3        Male  MA     21  '$75,000 - $99,999'
 7       3      Female  CA     33  '$75,000 - $99,999'
 8       3        Male  MA     41  '$50,000 - $74,999'
 9       3        Male  KY     32  '$35,000 - $49,999'
10       2      Female  CA     23  '$35,000 - $49,999'
11       3        Male  KY     25  '$50,000 - $74,999'
12       3      Female  MA     21  '$75,000 - $99,999'", header = T, row.names = 1)