https://www.kaggle.com/nowke9/ipldata -----包含IPL数据。
这是对IPL数据集进行的探索性研究。 (上面的数据的链接)将文件“ id”和“ match_id”合并后,我又创建了四个变量,分别是total_extras,total_runs_scored,total_fours_hit和total_sixes_hit。现在,我希望将这些新创建的变量组合到一个数据框中。当我将这些变量分配给一个单独的变量batsman_aggregate并仅选择所需的列时,我收到一条错误消息。
library(tidyverse)
deliveries_tbl <- read.csv("deliveries_edit.csv")
matches_tbl <- read.csv("matches.csv")
combined_matches_deliveries_tbl <- deliveries_tbl %>%
left_join(matches_tbl, by = c("match_id" = "id"))
# Add team score and team extra columns for each match, each inning.
total_score_extras_combined <- combined_matches_deliveries_tbl%>%
group_by(id, inning, date, batting_team, bowling_team, winner)%>%
mutate(total_score = sum(total_runs, na.rm = TRUE))%>%
mutate(total_extras = sum(extra_runs, na.rm = TRUE))%>%
group_by(total_score, total_extras, id, inning, date, batting_team, bowling_team, winner)%>%
select(id, inning, total_score, total_extras, date, batting_team, bowling_team, winner)%>%
distinct(total_score, total_extras)%>%
glimpse()%>%
ungroup()
# Batsman Aggregate (Runs Balls, fours, six , Sr)
# Batsman score in each match
batsman_score_in_a_match <- combined_matches_deliveries_tbl %>%
group_by(id, inning, batting_team, batsman)%>%
mutate(total_batsman_runs = sum(batsman_runs, na.rm = TRUE))%>%
distinct(total_batsman_runs)%>%
glimpse()%>%
ungroup()
# Number of deliveries played .
balls_faced <- combined_matches_deliveries_tbl %>%
filter(wide_runs == 0)%>%
group_by(id, inning, batsman)%>%
summarise(deliveries_played = n())%>%
ungroup()
# Number of 4 and 6s by a batsman in each match.
fours_hit <- combined_matches_deliveries_tbl %>%
filter(batsman_runs == 4)%>%
group_by(id, inning, batsman)%>%
summarise(fours_hit = n())%>%
glimpse()%>%
ungroup()
sixes_hit <- combined_matches_deliveries_tbl %>%
filter(batsman_runs == 6)%>%
group_by(id, inning, batsman)%>%
summarise(sixes_hit = n())%>%
glimpse()%>%
ungroup()
batsman_aggregate <- c(batsman_score_in_a_match, balls_faced, fours_hit, sixes_hit)%>%
select(id, inning, batsman, total_batsman_runs, deliveries_played, fours_hit, sixes_hit)
错误消息显示为:-
Error: `select()` doesn't handle lists.
所需的输出是新建的变量创建的数据集。
答案 0 :(得分:1)
您必须加入这四个表,而不是使用c
进行合并。
连接类型为left_join
,因此所有击球手都包括在输出中。那些没有遇到任何球或没有碰到任何边界的人将具有NA,但是您可以轻松地将它们替换为0。
我已经忽略了by
,因为dplyr会假设您想要c("id", "inning", "batsman")
,这是所有四个数据集中仅有的3个公共列。
batsman_aggregate <- left_join(batsman_score_in_a_match, balls_faced) %>%
left_join(fours_hit) %>%
left_join(sixes_hit) %>%
select(id, inning, batsman, total_batsman_runs, deliveries_played, fours_hit, sixes_hit) %>%
replace(is.na(.), 0)
# A tibble: 11,335 x 7
id inning batsman total_batsman_runs deliveries_played fours_hit sixes_hit
<int> <int> <fct> <int> <dbl> <dbl> <dbl>
1 1 1 DA Warner 14 8 2 1
2 1 1 S Dhawan 40 31 5 0
3 1 1 MC Henriques 52 37 3 2
4 1 1 Yuvraj Singh 62 27 7 3
5 1 1 DJ Hooda 16 12 0 1
6 1 1 BCJ Cutting 16 6 0 2
7 1 2 CH Gayle 32 21 2 3
8 1 2 Mandeep Singh 24 16 5 0
9 1 2 TM Head 30 22 3 0
10 1 2 KM Jadhav 31 16 4 1
# ... with 11,325 more rows
还有2名蝙蝠侠没有面对任何任何交付:
batsman_aggregate %>% filter(deliveries_played==0)
# A tibble: 2 x 7
id inning batsman total_batsman_runs deliveries_played fours_hit sixes_hit
<int> <int> <fct> <int> <dbl> <dbl> <dbl>
1 482 2 MK Pandey 0 0 0 0
2 7907 1 MJ McClenaghan 2 0 0 0
其中一个显然获得了2分!因此,我认为batsman_runs
列有一些错误。游戏是here,并明确表示在第一局的倒数第二次交付时,得分为2个宽度,而不是击球手。