我正在寻找一些帮助,以帮助您理解使用dplyr的管道系统和功能汇总。我觉得我的编码有点冗长,可以简化。所以这里有两个问题,因为我知道我缺少一些概念,但是我不确定在哪里缺乏知识。我在底部包含了完整的代码。预先感谢,因为这个问题要大一些。
1a。根据下面的示例数据并使用dplyr,有没有一种方法可以在不使用中间表的情况下计算每个团队的比赛(日期)?
1b。我已经包含了计算n_games无效的原始方法。为什么?
set.seed(123)
shot_df_ex <- tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
Date = sample(as.Date(c("2019-08-01",
"2019-09-01",
"2018-08-01",
"2018-09-01",
"2017-08-01",
"2017-09-01")),
size = 250, replace = TRUE),
Type = sample(c("shot","goal"), size = 250,
replace = TRUE, prob = c(0.9,0.1))
)
# count shots per team per game(date)
n_shots_per_game <- shot_df_ex %>%
count(Team_Name,Date)
n_shots_per_game
# count games (dates) per team [ISSUES!!!]
# is there a way to do this piping from the shot_df_ex tibble instead of
# using an intermediate tibble?
# count number of games using the tibble created above [DOES NOT WORK--WHY?]
n_games <- n_shots_per_game %>%
count(Team_Name)
n_games #what is this counting? It should be 6 for each.
# this works, but isn't count() just a quicker way to run
# group_by() %>% summarise()?
n_games <- n_shots_per_game %>%
group_by(Team_Name) %>%
summarise(N_Games=n())
n_games
# load librarys ------------------------------------------------
library(tidyverse)
# build sample shot data ---------------------------------------
set.seed(123)
shot_df_ex <- tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
Date = sample(as.Date(c("2019-08-01",
"2019-09-01",
"2018-08-01",
"2018-09-01",
"2017-08-01",
"2017-09-01")),
size = 250, replace = TRUE),
Type = sample(c("shot","goal"), size = 250,
replace = TRUE, prob = c(0.9,0.1))
)
# calculate data ----------------------------------------------
# since every row is a shot, the following function counts shots for ea. team
n_shots <- shot_df_ex %>%
count(Team_Name) %>%
rename(N_Shots = n)
n_shots
# do the same for goals for each team
n_goals <- shot_df_ex %>%
filter(Type == "goal") %>%
count(Team_Name,sort = T) %>%
rename(N_Goals = n) %>%
arrange(Team_Name)
n_goals
# count shots per team per game(date)
n_shots_per_game <- shot_df_ex %>%
count(Team_Name,Date)
n_shots_per_game
# count games (dates) per team [ISSUES!!!]
# is there a way to do this piping from the shot_df_ex tibble instead of
# using an intermediate tibble?
# count number of games using the tibble created above [DOES NOT WORK]
n_games <- n_shots_per_game %>%
count(Team_Name)
n_games #what is this counting? It should be 6 for each.
# this works, but isn't count() just a quicker way to run
# group_by() %>% summarise()?
n_games <- n_shots_per_game %>%
group_by(Team_Name) %>%
summarise(N_Games=n())
n_games
# combine data ------------------------------------------------
# combine columns and add average shots per game
shot_table_ex <- n_games %>%
left_join(n_shots) %>%
left_join(n_goals)
# final table with final average calculations
shot_table_ex <- shot_table_ex %>%
mutate(Shots_per_Game = round(N_Shots / N_Games, 1),
Goals_per_Game = round(N_Goals / N_Games, 1)) %>%
arrange(Team_Name)
shot_table_ex
答案 0 :(得分:1)
对于1a,您可以直接从tibble()函数直接传递到count()。即。
tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
Date = sample(as.Date(c("2019-08-01",
"2019-09-01",
"2018-08-01",
"2018-09-01",
"2017-08-01",
"2017-09-01")),
size = 250, replace = TRUE),
Type = sample(c("shot","goal"), size = 250,
replace = TRUE, prob = c(0.9,0.1))) %>%
count(Team_Name,Date)
在1b中,count()使用您的列n
(即射门次数)作为权重变量,因此将每个团队的总射门次数而不是行数相加。它会显示一条消息告诉您:
Using `n` as weighting variable i Quiet this message with `wt = n` or count rows with `wt = 1`
使用count(Team_Name, wt=n())
将提供您想要的行为。
修改:第2部分
shot_table_ex <- tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
Date = sample(as.Date(c("2019-08-01",
"2019-09-01",
"2018-08-01",
"2018-09-01",
"2017-08-01",
"2017-09-01")),
size = 250, replace = TRUE),
Type = sample(c("shot","goal"), size = 250,
replace = TRUE, prob = c(0.9,0.1))) %>%
group_by(Team_Name) %>%
summarise(n_shots = n(),
n_goals = sum(Type == "goal"),
n_games = n_distinct(Date)) %>%
mutate(Shots_per_Game = round(n_shots / n_games, 1),
Goals_per_Game = round(n_goals / n_games, 1))
答案 1 :(得分:1)
1a。根据下面的示例数据并使用dplyr,有没有一种方法可以在不使用中间表的情况下计算每个团队的比赛(日期)?
这就是我要做的:
shot_df_ex %>%
distinct(Team_Name, Date) %>% #Keeps only the cols given and one of each combo
count(Team_Name)
您还可以使用唯一的:
shot_df_ex %>%
group_by(Team_Name) %>%
summarize(N_Games = length(unique(Date))
1b。我已经包含了计算n_games的原始方法 工作。为什么?
您的代码对我有用。您是否保存了中间表?它正在计算每个团队预期的6人。
- 以下是我创建摘要表的过程。我知道管道是为了减少某些中间产物的产生 变量/表。我在哪里可以结合以下步骤来创建 决赛桌的中间步骤最少?
shot_df_ex %>%
group_by(Team_Name) %>%
summarize(
N_Games = length(unique(Date)),
N_Shots = sum(Type == "shot"),
N_Goals = sum(Type == "goal")
) %>%
mutate(Shots_per_Game = round(N_Shots / N_Games, 1),
Goals_per_Game = round(N_Goals / N_Games, 1))
只要您不需要更改分组,就可以一次使用多个汇总步骤。我们在这里(在sum
调用中利用True为1并将False为0的解释。length
当然将为我们提供unique
产生的向量的长度
此(计数)有效,但是count()并不是运行group_by()更快的方法%>%summarise()吗?
count
只是group_by(col) %>% tally()
的组合,而tally本质上是summarize(x=n())
,所以是的。 :)