我有以下示例数据集:
df <- tibble(
"PLAYER" = c("Corey Kluber", "CLayton Kershaw", "Max Scherzer", "Chris Sale",
"Corey Kluber", "Jake Arrieta", "Jose Urena", "Yu Darvish"),
"YEAR" = c(2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017),
"WHIP" = c(1.24, 1.50, 1.70, 1.35, 1.42, 1.33, 1.61, 1.10)
)
真实数据集从2000年到2017年。我如何使用filter()
(或通常为dplyr
)来培养所有在多个赛季中都活跃的球员?例如,在上面的示例中,Corey Kluber将于2016年和2017年推出。我如何使用dplyr
软件包来抚养他?我想这是这样的:
df %>%
select(PLAYER, YEAR, WHIP) %>% #MY SET HAS MORE VARIABLES THAN THE SAMPLE SHOWS
filter(PLAYER %in% YEAR == c(2016,2017))
当我希望<0 rows> (or 0-length row.names)
出现两次时,这仅返回Corey Kluber
。谢谢。
答案 0 :(得分:2)
一个人可以使用dplyr::n_distinct
来找出一个玩家出现在不同的session/year
中的次数。您必须对PLAYER
进行分组,然后使用条件n_distinct(YEAR)>1
过滤数据:
library(tidyverse)
df %>% group_by(PLAYER) %>%
filter(n_distinct(YEAR) > 1) # A player has appeared in many sessions
# # A tibble: 2 x 3
# # Groups: PLAYER [1]
# PLAYER YEAR WHIP
# <chr> <dbl> <dbl>
# 1 Corey Kluber 2016 1.24
# 2 Corey Kluber 2017 1.42
#
如果OP
有兴趣仅知道这些玩家的名字,那么
df %>% group_by(PLAYER) %>%
filter(n_distinct(YEAR) > 1) %>%
select(PLAYER) %>%
distinct()
# # A tibble: 1 x 1
# # Groups: PLAYER [1]
# PLAYER
# <chr>
# 1 Corey Kluber