按数据帧中出现的次数划分数据帧内的特定因子(R)

时间:2017-05-22 00:31:04

标签: r dplyr

我正在使用RStudio中的MLB Statcast数据,并试图计算每个投手的挥杆打击率(挥杆打击量除以投手投掷的总投球数)。这里的例子是一个示例数据框:

pitcher_name <- c('AJ Griffin','AJ Griffin','AJ Griffin','AJ Griffin','AJ Griffin',
                  'AJ Griffin','Adam Conley','Adam Conley','Adam Conley','Adam Conley',
                  'Adam Conley','Adam Conley')

description <- c('foul','swinging_strike','swinging_strike','swinging_strike_blocked',
                 'ball','hit_into_play','swinging_strike','swinging_strike',
                 'swinging_strike','swinging_strike_blocked','swinging_strike_blocked','ball')

pitch_analysis.data <- data.frame(pitcher_name, description)

最终目标是计算每个投手的挥动击球(挥动击球和挥动击球被阻挡),然后将此数字除以每个投手投掷的总投球数。因此,对于这个例子,最终答案应该是AJ Griffin的50%(3个挥杆超过6个球场)和Adam Conley的83%(5个挥杆超过6个球场)。我使用dplyr包尝试了以下命令:

P <- pitch_analysis.data %>% group_by(pitcher_name, description) %>% count(description)

这给了我每个描述的计数,但是我不知道如何使用dplyr来完成将两种类型的挥动打击分组在一起然后再除以每个类型的总节距数的最后一步投手。任何意见都将不胜感激,谢谢!

3 个答案:

答案 0 :(得分:1)

使用dplyrstringr软件包,您可以执行以下操作:

library(dplyr)
library(stringr)
P <- pitch_analysis.data %>% 
group_by(pitcher_name) %>%     
summarise(r=sum(str_detect(description,"swinging"))/n())

返回:

pitcher_name         r
        <fctr>     <dbl>
1  Adam Conley 0.8333333
2   AJ Griffin 0.5000000

我们使用str_detect在说明中检测到“摇摆”一词,并使用sum计算观察到的行数。每组的总行数由n()给出。

答案 1 :(得分:0)

这种方式如何仅使用dplyr?

pitch_analysis.data <- data_frame(pitcher_name, description)
pitch_analysis.data %>%
mutate(simplified_description=ifelse(description=="swinging_strike_blocked",
   "swinging_strike", description)) %>%
group_by(pitcher_name, simplified_description) %>%
count(simplified_description) 

Source: local data frame [6 x 3]
Groups: pitcher_name [?]

  pitcher_name simplified_description     n
         <chr>                  <chr> <int>
1  Adam Conley                   ball     1
2  Adam Conley        swinging_strike     5
3   AJ Griffin                   ball     1
4   AJ Griffin                   foul     1
5   AJ Griffin          hit_into_play     1
6   AJ Griffin        swinging_strike     3

答案 2 :(得分:0)

以下是使用data.table

的选项
library(data.table)
setDT(pitch_analysis.data)[, .(r = sum(grepl('swinging', description))/.N), pitcher_name]
#   pitcher_name         r
#1:   AJ Griffin 0.5000000
#2:  Adam Conley 0.8333333

base R使用rowsum

with(pitch_analysis.data, rowsum(+(grepl('swinging', description)), 
         pitcher_name)/tabulate(pitcher_name))
#                 [,1]
#Adam Conley 0.8333333
#AJ Griffin  0.5000000

或使用table/prop.table

prop.table(table(pitch_analysis.data[[1]], grepl('swinging', 
            pitch_analysis.data$description)), 1)[,2]