Question

我希望基于来自多个变量的各种参数来压缩数据框，但我不确定如何以最简单的方式实现它。我认为这将需要某种个性化功能，但是我在编写功能方面经验不足。

基本上，我的数据框目前看起来像这样：

chainID     teamID        statID        startType       endType        

1           Team A     Effective Pass      TO              TO
1           Team A     Effective Pass      TO              TO
1           Team A     Effective Pass      TO              TO
1           Team A     Effective Pass      TO              TO
1           Team A     Ineffective Pass    TO              TO
2           Team B     Effective Pass      TO              SH
2           Team B     Entry               TO              SH
2           Team B     Effective Pass      TO              SH
2           Team B     Shot                TO              SH
3           Team A     Effective Pass      ST              TO
3           Team A     Entry               ST              TO
3           Team A     Ineffective Pass    ST              TO
4           Team B     Effective Pass      TO              ST
4           Team B     Effective Pass      TO              ST
4           Team B     Ineffective Pass    TO              ST
5           Team A     Effective Pass      TO              SH
5           Team A     Entry               TO              SH
5           Team A     Goal                TO              SH
6           Team B     Effective Pass      CB              TO
6           Team B     Effective Pass      CB              TO
6           Team B     Ineffective Pass    CB              TO
7           Team A     Effective Pass      TO              ST
7           Team A     Ineffective Pass    TO              ST

我想做的是，每当Entry的{{1}}列中出现statID一词时，我想保留该行和该{{ 1}}，同时删除该特定chainID的所有其他行（请参见chainID 2和5）。另外，我还需要的是，如果chainID在statID中包含Entry，但是该特定chainID的最后一行未以目标或击球结尾，那么我希望下一个chainID保留在数据集中，如我的示例所示使用chainID 3和4。然后该函数继续像开始时那样按每个chainID查找条目出现的次数。例如

chainID

Answer 1

答案分为两个功能。第一个功能select_rows，根据"Entry"的存在从每个组中选择行。第二个功能select_groups找出未以"Goal"或"Shot"结尾的组。

library(dplyr)

select_rows <- function(anyEntry, statID) {
   #If anyEntry value is not 0
   if(anyEntry[1L]) { 
      #If the last value is either "Goal" or "Shot" select "Entry" row and last row
      #else select all the rows from "Entry" to last row. 
      if(last(statID) %in% c("Goal", "Shot")) c(anyEntry[1L], length(anyEntry)) 
         else anyEntry[1L] : length(anyEntry) 
     } else 0
}

select_groups <- function(anyEntry, statID) {
    anyEntry[1L] & !last(statID) %in% c("Goal", "Shot")
}

我们创建anyEntry列，该列的行号在第一个"Entry"值所在的组中，否则为0。我们分别应用select_rows和select_groups函数并绑定列。

df1 <- df %>%
        group_by(chainID) %>%
        mutate(anyEntry = which.max(statID == "Entry") * any(statID == "Entry"))

Ids <- df1 %>%
         summarise(newEntry = select_groups(anyEntry, statID)) %>%
         filter(newEntry) %>% pull(chainID)

df1 %>%
  slice(select_rows(anyEntry, statID)) %>%
  bind_rows(df %>% filter(chainID %in% (Ids + 1))) %>%
  select(-anyEntry) %>%
  arrange(chainID)

#   chainID teamID statID    startType  endType
#     <int> <fct>  <fct>        <fct>     <fct>  
#1       2 TeamB  Entry           TO        SH     
#2       2 TeamB  Shot            TO        SH     
#3       3 TeamA  Entry           ST        TO     
#4       3 TeamA  IneffectivePass ST        TO     
#5       4 TeamB  EffectivePass   TO        ST     
#6       4 TeamB  EffectivePass   TO        ST     
#7       4 TeamB  IneffectivePass TO        ST     
#8       5 TeamB  Entry           TO        SH     
#9       5 TeamB  Goal            TO        SH

使用来自某些变量的多个参数来压缩数据框

1 个答案: