计算计数类别中的计数比例+取决于其他分类变量

时间:2016-11-21 13:02:54

标签: r dplyr data-manipulation

我希望这个令人费解的头衔有道理,但我遇到的问题并不容易让人头疼。

玩具数据集列出了客户访问以及客户豁免状态和访问类型:

df <- structure(list(Customer = structure(c(8L, 2L, 5L, 4L, 4L, 1L, 
1L, 6L, 6L, 7L, 7L, 7L, 3L, 3L, 3L), .Label = c("Aaron", "Elizabeth", 
"Frank", "John", "Mary", "Pam", "Rob", "Sam"), class = "factor"), 
    Exemption = structure(c(2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 
    2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Exempt", "Non-exempt"
    ), class = "factor"), Type = structure(c(1L, 1L, 2L, 1L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L), .Label = c("Type 1", 
    "Type 2"), class = "factor")), .Names = c("Customer", "Exemption", 
"Type"), class = "data.frame", row.names = c(NA, -15L))

    Customer  Exemption   Type
1        Sam Non-exempt Type 1
2  Elizabeth     Exempt Type 1
3       Mary     Exempt Type 2
4       John Non-exempt Type 1
5       John Non-exempt Type 2
6      Aaron Non-exempt Type 2
7      Aaron Non-exempt Type 2
8        Pam     Exempt Type 2
9        Pam     Exempt Type 2
10       Rob Non-exempt Type 2
11       Rob Non-exempt Type 2
12       Rob Non-exempt Type 1
13     Frank     Exempt Type 1
14     Frank     Exempt Type 1
15     Frank     Exempt Type 2

我想按照他们的访问次数对客户进行分类,然后在其中计算Type1 / 2访问的比例,也可以按照免税状态细分结果,例如输出如下:

   Number_of_visits  Exemption   Type Proportion
1                 1 Non-exempt Type 1       1.00
2                 1 Non-exempt Type 2       0.00
3                 1     Exempt Type 1       0.50
4                 1     Exempt Type 2       0.50
5                 2 Non-exempt Type 1       0.25
6                 2 Non-exempt Type 2       0.75
7                 2     Exempt Type 1       0.00
8                 2     Exempt Type 2       1.00
9                 3 Non-exempt Type 1       0.33
10                3 Non-exempt Type 2       0.67
11                3     Exempt Type 1       0.67
12                3     Exempt Type 2       0.33

我使用group_by(Customer, Type) %>% summarise(n())使用dplyr尝试了一些事情,这似乎不正确。

1 个答案:

答案 0 :(得分:1)

您可以使用count中的dplyr来计算按Exemption分组的TypeNumber_of_visits的出现次数:

library(dplyr)
library(tidyr)
res <- df %>% group_by(Customer) %>% 
              mutate(Number_of_visits=n()) %>% 
              group_by(Number_of_visits) %>% 
              count(Exemption, Type) %>%
              complete(Type, fill=list(n=0)) %>%
              group_by(Number_of_visits,Exemption) %>% 
              mutate(Proportion=n/sum(n))

注意:

  1. 首先group_by Customer使用n()计算访问次数。
  2. 然后group_by Number_of_visits并使用count计算ExemptionType的每个值对的出现次数。这会创建一个名为n的列,其中包含此计数。
  3. 使用tidyr::completeExemptionType填写任意缺失值对,计数为零。
  4. 最后,group_by Number_of_visitsExemption来计算所需的Proportion
  5. 使用您的数据的结果符合预期。

    print(res)
    ##Source: local data frame [12 x 5]
    ##Groups: Number_of_visits, Exemption [6]
    ##
    ##   Number_of_visits  Exemption   Type     n Proportion
    ##              <int>     <fctr> <fctr> <dbl>      <dbl>
    ##1                 1     Exempt Type 1     1  0.5000000
    ##2                 1     Exempt Type 2     1  0.5000000
    ##3                 1 Non-exempt Type 1     1  1.0000000
    ##4                 1 Non-exempt Type 2     0  0.0000000
    ##5                 2     Exempt Type 1     0  0.0000000
    ##6                 2     Exempt Type 2     2  1.0000000
    ##7                 2 Non-exempt Type 1     1  0.2500000
    ##8                 2 Non-exempt Type 2     3  0.7500000
    ##9                 3     Exempt Type 1     2  0.6666667
    ##10                3     Exempt Type 2     1  0.3333333
    ##11                3 Non-exempt Type 1     1  0.3333333
    ##12                3 Non-exempt Type 2     2  0.6666667