按组计数非零观测数

时间:2019-06-25 04:30:47

标签: r

对于以下数据-我想计算每年每班的学生人数。

   Class  Students Gender Height Year_1999  Year_2000 Year_2001 Year_2002
     1      Mark     M     180      80        54         22       12
     2      John     M     234      0         59         32       62
     1      Tom      M     124      0         53         26       12
     2      Jane     F     180      80        54         22       0
     3      Kim      F     140      0         2           3       32

输出应为

    Class  Year_1999   Year_2000   Year_2001  Year_2002
     1       1            2            2         2
     2       1            2            2         1
     3       0            1            1         1

我尝试了以下方法,但运气不佳

Number_obs = df %>% 
    group_by(class) %>% 
    summarise(count=n())

3 个答案:

答案 0 :(得分:1)

我们可以在summarise_at中使用dplyr。按“类别”分组后,循环浏览matches列名称中具有“ year” summarise_at的列,获得sum的值不等于0

library(dplyr)
df1 %>% 
   group_by(Class) %>%
   summarise_at(vars(matches("Year")), list(~ sum(as.logical(.))))
# A tibble: 3 x 5
#  Class Year_1999 Year_2000 Year_2001 Year_2002
#  <int>     <int>     <int>     <int>     <int>
#1     1         1         2         2         2
#2     2         1         2         2         1
#3     3         0         1         1         1

或者我们可以gather转换为“长”格式,对单列执行group_by操作,然后spread转换为“宽”格式

library(tidyr)
df1 %>% 
    gather(key, val, matches("Year")) %>%
    group_by(Class, key) %>%
    summarise(val = sum(val  != 0)) %>% 
    spread(key, val)

或使用data.table

library(data.table)
setDT(df1)[, lapply(.SD, function(x) sum(as.logical(x))), .(Class), .SDcols = 5:8]

或将base Raggregate一起使用

aggregate(.~ Class, df1[-(2:4)], function(x) sum(x != 0))
#    Class Year_1999 Year_2000 Year_2001 Year_2002
#1     1         1         2         2         2
#2     2         1         2         2         1
#3     3         0         1         1         1

或使用rowsum

rowsum(+(!!df1[5:8]), df1$Class)
#    Year_1999 Year_2000 Year_2001 Year_2002
#1         1         2         2         2
#2         1         2         2         1
#3         0         1         1         1

或使用colSums

t(sapply(split(as.data.frame(df1[5:8] != 0), df1$Class), colSums))

数据

df1 <- structure(list(Class = c(1L, 2L, 1L, 2L, 3L), Students = c("Mark", 
"John", "Tom", "Jane", "Kim"), Gender = c("M", "M", "M", "F", 
"F"), Height = c(180L, 234L, 124L, 180L, 140L), Year_1999 = c(80L, 
0L, 0L, 80L, 0L), Year_2000 = c(54L, 59L, 53L, 54L, 2L), Year_2001 = c(22L, 
32L, 26L, 22L, 3L), 
Year_2002 = c(12L, 62L, 12L, 0L, 32L)), class = "data.frame", 
  row.names = c(NA, 
-5L))

答案 1 :(得分:1)

类似于 @akrun colSums解决方案,使用by

do.call(rbind, by(df[5:8] > 0, df[1], colSums))
#   Year_1999 Year_2000 Year_2001 Year_2002
# 1         1         2         2         2
# 2         1         2         2         1
# 3         0         1         1         1

Reduce(rbind, by(df[5:8] > 0, df[1], colSums))
#      Year_1999 Year_2000 Year_2001 Year_2002
# init         1         2         2         2
#              1         2         2         1
#              0         1         1         1

do.call更快。

答案 2 :(得分:0)

使用dplyr,我们可以使用summarise_at

library(dplyr)

df %>%
  group_by(Class) %>%
  summarise_at(vars(starts_with("Year")), ~sum(. != 0))

#  Class Year_1999 Year_2000 Year_2001 Year_2002
#  <int>     <int>     <int>     <int>     <int>
#1     1         1         2         2         2
#2     2         1         2         2         1
#3     3         0         1         1         1