计算多个组中变量的出现百分比

时间:2018-03-25 13:02:34

标签: r dplyr data.table tidyverse purrr

示例数据

set.seed(123)
df <- data.frame(loc.id = rep(1:1000, each = 35), year = rep(1980:2014,times = 1000),month.id = sample(c(1:4,8:10,12),35*1000,replace = T))

数据框有一个名为month.id的变量的1000个位置X 35年的数据,基本上是一年中的月份。对于每年,我想计算每个月的发生百分比。对于例如1980年,

month.vec <- df[df$year == 1980,]
table(month.vec$month.id)
1   2   3   4   8   9  10  12 
106 132 116 122 114 130 141 139 

计算月份的发生百分比:

table(month.vec$month.id)/length(month.vec$month.id) * 100
1    2    3    4    8    9   10   12 
10.6 13.2 11.6 12.2 11.4 13.0 14.1 13.9 

我希望有一个像这样的表:

    year month percent
    1980   1    10.6
    1980   2    13.2
    1980   3    11.6
    1980   4    12.2
    1980   5    NA
    1980   6    NA
    1980   7    NA
    1980   8    11.4    
    1980   9    13
    1980   10   14.1
    1980   11   NA
    1980   12   13.9

由于缺少5,6,7,11个月,我只想在这些月份添加额外的行和NAs。如果可能的话,我愿意 就像这样的dplyr解决方案:

   library(dplyr)
   df %>% group_by(year) %>% summarise(percentage.contri = table(month.id)/length(month.id)*100)  

2 个答案:

答案 0 :(得分:4)

使用dplyrtidyr

的解决方案
# To get month as integer use (or add as.integer to mutate):
# df$month.id <- as.integer(df$month.id)

library(dplyr)
library(tidyr)
df %>%
    group_by(year, month.id) %>% 
    # Count occurrences per year & month
    summarise(n = n()) %>%
    # Get percent per month (year number is calculated with sum(n))
    mutate(percent = n / sum(n) * 100) %>%
    # Fill in missing months
    complete(year, month.id = 1:12, fill = list(percent = 0)) %>%
    select(year, month.id, percent)
    year month.id percent
   <int>    <dbl>   <dbl>
 1  1980     1.00    10.6
 2  1980     2.00    13.2
 3  1980     3.00    11.6
 4  1980     4.00    12.2
 5  1980     5.00     0  
 6  1980     6.00     0  
 7  1980     7.00     0  
 8  1980     8.00    11.4
 9  1980     9.00    13.0
10  1980    10.0     14.1
11  1980    11.0      0  
12  1980    12.0     13.9

答案 1 :(得分:3)

基础R解决方案:

tab <- table(month.vec$year, factor(month.vec$month.id, levels = 1:12))/length(month.vec$month.id) * 100
dfnew <- as.data.frame(tab)

给出:

> dfnew
   Var1 Var2 Freq
1  1980    1 10.6
2  1980    2 13.2
3  1980    3 11.6
4  1980    4 12.2
5  1980    5  0.0
6  1980    6  0.0
7  1980    7  0.0
8  1980    8 11.4
9  1980    9 13.0
10 1980   10 14.1
11 1980   11  0.0
12 1980   12 13.9

data.table

library(data.table)

setDT(month.vec)[, .N, by = .(year, month.id)
                 ][.(year = 1980, month.id = 1:12), on = .(year, month.id)
                   ][, N := 100 * N/sum(N, na.rm = TRUE)][]