R由组织组织的相对频率

时间:2018-05-15 18:56:57

标签: r group-by dplyr frequency summarize

我有以下数据框:

> LikelyRenew_ReasonB %>%
    +   mutate(cum_Sum = ave(freq,Name,FUN = cumsum))
          Name x freq cum_Sum
 1       costC 1   10      10
 2       costC 2   11      21
 3       costC 3   17      38
 4       costC 4  149     187
 5   productsC 1   31      31
 6   productsC 2   40      71
 7   productsC 3   30     101
 8   productsC 4   86     187
 9     communC 1   51      51
 10    communC 2   50     101
 11    communC 3   34     135
 12    communC 4   52     187
 13 reimburseC 1   42      42
 14 reimburseC 2   26      68
 15 reimburseC 3   25      93
 16 reimburseC 4   94     187
 17    policyC 1   31      31
 18    policyC 2   25      56
 19    policyC 3   28      84
 20    policyC 4  103     187
 21  discountC 1    2       2
 22  discountC 2    2       4
 23  discountC 3    3       7
 24  discountC 4  180     187

这是变量的样子:

> dput(head(LikelyRenew_ReasonB))
structure(list(Name = c("costC", "costC", "costC", "costC", "productsC", 
"productsC"), x = c(1, 2, 3, 4, 1, 2), freq = c(10L, 11L, 17L, 
149L, 31L, 40L)), .Names = c("Name", "x", "freq"), row.names = c(NA, 
6L), class = "data.frame")

我试图获得每个组,每个频率分数的相对频率,然后是该组的相对频率之和。我在下面放了一个我正在寻找的样本 - 前三行是他们的freq / cum_Sum [x == 4]。最后一行应该是这3行的总和。

这可能吗?我完全难过了。

          Name x freq cum_Sum    IdealOutput   *how i calculated IdealOutput
 1       costC 1   10      10         5.35       (10/187)
 2       costC 2   11      21         5.88       (11/187)
 3       costC 3   17      38         9.09       (17/187) 
 4       costC 4  149     187         20.32      (sum of above 3 values)

1 个答案:

答案 0 :(得分:2)

您可以尝试使用dplyr::lag上的cum_Sum来计算群组最后一行的IdealOutput

可以使用条件row_number() == n()

找到组的最后一行
library(dplyr)

LikelyRenew_ReasonB %>% group_by(Name) %>%
  arrange(Name, x) %>%
  mutate(cum_Sum = cumsum(freq)) %>%
  mutate(IdealOutput = ifelse(row_number() == n(), 
                       lag(cum_Sum)/sum(freq), freq/sum(freq))) 

#   # A tibble: 6 x 5
#   # Groups: Name [2]
#   Name          x  freq cum_Sum IdealOutput
# <chr>     <dbl> <int>   <int>       <dbl>
# 1 costC      1.00    10      10      0.0535
# 2 costC      2.00    11      21      0.0588
# 3 costC      3.00    17      38      0.0909
# 4 costC      4.00   149     187      0.203 
# 5 productsC  1.00    31      31      0.437 
# 6 productsC  2.00    40      71      0.437 

数据:

LikelyRenew_ReasonB  <- structure(list(Name = c("costC", "costC", "costC", "costC", "productsC", 
"productsC"), x = c(1, 2, 3, 4, 1, 2), freq = c(10L, 11L, 17L, 
149L, 31L, 40L)), .Names = c("Name", "x", "freq"), row.names = c(NA, 
6L), class = "data.frame")