有一定数量参赛作品的团队所占百分比

时间:2018-11-26 17:48:28

标签: r

我有一个看起来像这样的数据集:

> df
    teams people entries
1  A Team   6fd1      49
2  A Team   1df5       4
3  A Team   2hgt      19
4  A Team   8akt       4
5  A Team   sdf9      19
6  B Team   asc1      42
7  B Team   abm8      32
8  B Team   plo9      38
9  B Team   90la       5
10 B Team   8uil      23

> dput(df)
structure(list(teams = c("A Team", "A Team", "A Team", "A Team", 
"A Team", "B Team", "B Team", "B Team", "B Team", "B Team"), 
    people = c("6fd1", "1df5", "2hgt", "8akt", "sdf9", "asc1", 
    "abm8", "plo9", "90la", "8uil"), entries = c(49, 4, 19, 4, 
    19, 42, 32, 38, 5, 23)), .Names = c("teams", "people", "entries"
), row.names = c(NA, -10L), class = "data.frame")

通过这样做,我可以使一部分拥有75%以上的球队,尽管这很混乱,而且可能不是最好的方法:

#  sorted df and added cumulative percentage/sum and row number per team

> df
    teams people entries cumulative_sum cumulative_perc number
1  A Team   6fd1      49             49        51.57895      1
3  A Team   2hgt      19             68        71.57895      2
5  A Team   sdf9      19             87        91.57895      3
2  A Team   1df5       4             91        95.78947      4
4  A Team   8akt       4             95       100.00000      5
7  B Team   abm8      89             89        45.17766      1
6  B Team   asc1      42            131        66.49746      2
8  B Team   plo9      38            169        85.78680      3
10 B Team   8uil      23            192        97.46193      4
9  B Team   90la       5            197       100.00000      5

#  from this view, each team has 3/5 people (60%) reaching the minimum 75% 
#  entries, and using ddply, we can get that

ddply(df, 'teams', summarise,
      marker = min(which(cumulative_perc > 75)),
      total = NROW(teams),
      seventyfive = marker/total)

   teams marker total seventyfive
1 A Team      3     5       0.6
2 B Team      3     5       0.6

尽管可行,但我只想考虑第三人称入围百分比,实际上是团队入围的75%。例如,对于一个团队,其参赛作品的75%是72(向上舍入),这意味着我们只查看第三人的19个参赛作品中的4个,给那个团队2.21 / 5而不是3/5。

2 个答案:

答案 0 :(得分:1)

WHERE

df %>% group_by(teams) %>% summarise(seventyfive = { tmp1 <- ceiling(0.75 * sum(entries)); tmp2 <- sum(cumsum(entries) < tmp1) tmp2 + (tmp1 - sum(entries[1:tmp2])) / entries[tmp2 + 1] }) # A tibble: 2 x 2 # teams seventyfive # <chr> <dbl> # 1 A Team 2.21 # 2 B Team 2.78 是条目的75%,而tmp1是仍使累计百分比低于75%的最大条目数。然后,最后一行直接计算所需的数量。

答案 1 :(得分:1)

lead()为您提供当前组中下一行的变量。

下面的方法对一行进行过滤,该行是下一个变量的条目与最小条目数量的分数(0-1)。

df %>%
    group_by(teams) %>%
    arrange(teams, -entries) %>%
    mutate(delta = (ceiling(0.75 * sum(entries)) - cumsum(entries)) / lead(entries),
           marker = row_number() + delta) %>%
    filter(delta >= 0 & delta <= 1) %>%
    select(teams, marker)

# A tibble: 2 x 2
# Groups:   teams [2]
  teams  marker
  <chr>   <dbl>
1 A Team   2.21
2 B Team   2.78