根据年份计算连续连胜

时间:2019-04-19 13:49:59

标签: r

这是关于参加奥运会的运动员。 我应该计算出最长的获得奖牌的前十名运动员。

例如:在2004年,2008年,2012年赢得了->因此,运动员连续赢得了3次。

我正在学习有关R的信息,对此我迷失了方向。

我什至不知道从哪里开始解决这个问题。

尽可能“清除”我的数据: -只有获得金牌的运动员 -获得他们所赢得的实际年份

我的色谱柱(清洗后)

id    name          team        year    medal
1     john doe      USA         2004    gold
1     john doe      USA         2008    gold
1     john doe      USA         2012    gold
2     marc twain    GER         2016    gold
3     edgar poe     FIN         2000    gold
3     edgar poe     FIN         2008    gold

我已经尝试过类似的事情:

mutate(won =
           if_else(condition = year == year +4,
                   true = "won",
                   false = "lost"))

或类似的

mutate(won =
           if_else(
             condition = (year + 4) == tmp_year,
             true = "Following Year",
             false = if_else(
               condition = year == tmp_year,
               true = "Actual year",
               false = "No")))

在这里,我只得到“实际年份”,没有“答案”。

最后,我想要一张桌子,该表格显示出ahelte连续赢得金牌的次数。

例如,数据集就是这样:

id    name          won        
1     john doe      3
2     marc twain    1
3     edgar poe     1

编辑:我不是在寻找完整的答案,更像是灵感:看哪些功能可能很有趣。

2 个答案:

答案 0 :(得分:1)

使用dplyr,我们可以针对每个diff使用name,然后计算group_by name来计算金牌获胜年份的差,并计算出差连续的奖金。

library(dplyr)

df %>%
 group_by(name) %>%
 mutate(diff = c(4,diff(year))) %>%
 group_by(name, diff) %>%
 summarise(count = n()) %>%
 select(-diff)


#    name      count
#   <fct>     <int>
#1 edgarpoe      1
#2 edgarpoe      1
#3 johndoe       3
#4 marctwain     1

答案 1 :(得分:1)

以下是使用cumsumdplyr::lead的一个选项,默认情况下等于Year + 4(考虑到玩家可以拥有多个奖牌的情况)

library(dplyr)
df %>% group_by(id) %>% 
       mutate(flag=lead(year,default = last(year)+4)-year, won=cumsum(flag==4)) %>% 
       select(-flag) %>% slice(which.max(won))

# A tibble: 3 x 6
# Groups:   id [3]
       id name       team   year medal   won
    <int> <chr>      <chr> <int> <chr> <int>
  1     1 john doe   USA    2012 gold      3
  2     2 marc twain GER    2016 gold      1
  3     3 edgar poe  FIN    2008 gold      1

通过@akrun更新

这可以通过紧凑的方式完成

df %>% group_by(id, name, team) %>% 
       mutate(yearlead = lead(year, default = year[n()]+4), yeardiff = yearlead - year) %>% 
       group_by( grp = rleid(case_when(yeardiff == 4 ~ as.integer(yeardiff), TRUE ~ row_number())), add = TRUE) %>% 
       summarise(n = n())

# A tibble: 4 x 5
# Groups:   id, name, team [?]
  id name       team    grp     n
  <int> <chr>      <chr> <int> <int>
  1     1 john doe   USA       1     3
  2     2 marc twain GER       1     1
  3     3 edgar poe  FIN       1     1
  4     3 edgar poe  FIN       2     1

数据(此数据与OP数据集不同)

df <- structure(list(id = c(1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 3L), name = c("john doe", "john doe", "john doe", "marc twain", "edgar poe", "edgar poe", "edgar poe", "edgar poe", "edgar poe"), 
       team = c("USA", "USA", "USA", "GER", "FIN", "FIN", "FIN", "FIN", "FIN"), year = c(2004L, 2008L, 2012L, 2016L, 2000L, 2008L, 2016L, 2020L, 2024L), medal = c("gold", "gold", "gold", "gold", "gold", "gold", "gold", "gold", "gold" )), class = "data.frame", row.names = c(NA, -9L))