如何合并多个变量并创建新的数据集?

时间:2020-04-16 06:52:27

标签: r list select dplyr

https://www.kaggle.com/nowke9/ipldata -----包含IPL数据。

这是对IPL数据集进行的探索性研究。 (上面的数据的链接)将文件“ id”和“ match_id”合并后,我又创建了四个变量,分别是total_extras,total_runs_scored,total_fours_hit和total_sixes_hit。现在,我希望将这些新创建的变量组合到一个数据框中。当我将这些变量分配给一个单独的变量batsman_aggregate并仅选择所需的列时,我收到一条错误消息。

    library(tidyverse)
    deliveries_tbl <- read.csv("deliveries_edit.csv")
    matches_tbl <- read.csv("matches.csv")

    combined_matches_deliveries_tbl <- deliveries_tbl %>%
    left_join(matches_tbl, by = c("match_id" = "id"))

    # Add team score and team extra columns for each match, each inning.
    total_score_extras_combined <- combined_matches_deliveries_tbl%>%
    group_by(id, inning, date, batting_team, bowling_team, winner)%>%
    mutate(total_score = sum(total_runs, na.rm = TRUE))%>%
    mutate(total_extras = sum(extra_runs, na.rm = TRUE))%>%
    group_by(total_score, total_extras, id, inning, date, batting_team, bowling_team, winner)%>%
    select(id, inning, total_score, total_extras, date, batting_team, bowling_team, winner)%>%
    distinct(total_score, total_extras)%>%
    glimpse()%>%
    ungroup()


# Batsman Aggregate (Runs Balls, fours, six , Sr)
# Batsman score in each match
batsman_score_in_a_match <- combined_matches_deliveries_tbl %>%
    group_by(id, inning, batting_team, batsman)%>%
    mutate(total_batsman_runs = sum(batsman_runs, na.rm = TRUE))%>%
    distinct(total_batsman_runs)%>%
    glimpse()%>%
        ungroup()

# Number of deliveries played . 
balls_faced <- combined_matches_deliveries_tbl %>%
    filter(wide_runs == 0)%>%
    group_by(id, inning, batsman)%>%
    summarise(deliveries_played = n())%>%
    ungroup()

# Number of 4 and 6s by a batsman in each match.
fours_hit <- combined_matches_deliveries_tbl %>%
    filter(batsman_runs == 4)%>%
    group_by(id, inning, batsman)%>%
    summarise(fours_hit = n())%>%
    glimpse()%>%
    ungroup()

sixes_hit <- combined_matches_deliveries_tbl %>%
    filter(batsman_runs == 6)%>%
    group_by(id, inning, batsman)%>%
    summarise(sixes_hit = n())%>%
    glimpse()%>%
    ungroup()

batsman_aggregate <- c(batsman_score_in_a_match, balls_faced, fours_hit, sixes_hit)%>%
    select(id, inning, batsman, total_batsman_runs, deliveries_played, fours_hit, sixes_hit)

错误消息显示为:-

Error: `select()` doesn't handle lists.

所需的输出是新建的变量创建的数据集。

1 个答案:

答案 0 :(得分:1)

您必须加入这四个表,而不是使用c进行合并。

连接类型为left_join,因此所有击球手都包括在输出中。那些没有遇到任何球​​或没有碰到任何边界的人将具有NA,但是您可以轻松地将它们替换为0。

我已经忽略了by,因为dplyr会假设您想要c("id", "inning", "batsman"),这是所有四个数据集中仅有的3个公共列。

batsman_aggregate <- left_join(batsman_score_in_a_match, balls_faced) %>%
  left_join(fours_hit) %>%
  left_join(sixes_hit) %>%
  select(id, inning, batsman, total_batsman_runs, deliveries_played, fours_hit, sixes_hit) %>%
  replace(is.na(.), 0)

# A tibble: 11,335 x 7
      id inning batsman       total_batsman_runs deliveries_played fours_hit sixes_hit
   <int>  <int> <fct>                      <int>             <dbl>     <dbl>     <dbl>
 1     1      1 DA Warner                     14                 8         2         1
 2     1      1 S Dhawan                      40                31         5         0
 3     1      1 MC Henriques                  52                37         3         2
 4     1      1 Yuvraj Singh                  62                27         7         3
 5     1      1 DJ Hooda                      16                12         0         1
 6     1      1 BCJ Cutting                   16                 6         0         2
 7     1      2 CH Gayle                      32                21         2         3
 8     1      2 Mandeep Singh                 24                16         5         0
 9     1      2 TM Head                       30                22         3         0
10     1      2 KM Jadhav                     31                16         4         1
# ... with 11,325 more rows

还有2名蝙蝠侠没有面对任何任何交付:

batsman_aggregate %>% filter(deliveries_played==0)
# A tibble: 2 x 7
     id inning batsman        total_batsman_runs deliveries_played fours_hit sixes_hit
  <int>  <int> <fct>                       <int>             <dbl>     <dbl>     <dbl>
1   482      2 MK Pandey                       0                 0         0         0
2  7907      1 MJ McClenaghan                  2                 0         0         0

其中一个显然获得了2分!因此,我认为batsman_runs列有一些错误。游戏是here,并明确表示在第一局的倒数第二次交付时,得分为2个宽度,而不是击球手。