我试图找出6个不同棒球变量和联盟平均值的每年联赛领先者之间的差异。我的目标是找到球员总数与联盟平均数之间的最大差异。例如,Babe Ruth在1927年打出60个本垒打,联盟平均每个球员6.30,所以差异是53.7。
我创建了这个leag_avg数据框:
leag_avg <- batting100 %>%
group_by(yearID) %>%
summarise(lgba_avg = round(sum(H, na.rm = T)/sum(AB, na.rm = T),digits = 3),
lghr_avg = round(mean(HR, na.rm = T), digits = 2) ,
lgrbi_avg = round(mean(RBI, na.rm = T),digits = 2),
lgslg_avg = round(mean(slg, na.rm = T),digits = 3),
lgobp_avg = round(mean(obp, na.rm = T),digits = 3),
lgruns_avg = round(mean(R, na.rm = T),digits = 2),
soratio = round(mean(so_ratio, na.rm = T), digits =2))
This gave me all years in the data frame (1871-2015) and the league average for each variable. 133 observations of 8 variables.
然后我发现每年最高的本垒打:
bestHR <- batting100 %>%
group_by(yearID) %>%
summarise(highest_HR = max(HR))
然后我合并将playerID添加到数据框:
bestHR2 <- merge(bestHR, batting100[, c("yearID", "HR", "playerID")], by.x = c("yearID", "highest_HR"), by.y = c("yearID", "HR"))
BestHR2返回对3个变量的153个观测值。由于关系,我的联盟平均观察次数超过20次。为了使我的观察结果达到133所以我可以进行计算,我将需要消除关系。有谁知道如何做到这一点?例如,1886年,有2人与11个本垒打并列。我怎样才能删除其中一个?