我想创建一个滚动函数,它有条件地计算前一行中两列的出现次数。
举个例子,我有一个如下所示的数据集。
# Generate data
set.seed(123)
test <- data.frame(
Round = rep(1:5, times = 3),
Team = rep(c("Team 1", "Team 2", "Team 3"), each = 5),
Venue = sample(sample(c("Venue A", "Venue B"), 15, replace = T))
)
Round Team Venue
1 1 Team 1 Venue B
2 2 Team 1 Venue B
3 3 Team 1 Venue A
4 4 Team 1 Venue A
5 5 Team 1 Venue B
6 1 Team 2 Venue B
7 2 Team 2 Venue B
8 3 Team 2 Venue A
9 4 Team 2 Venue A
10 5 Team 2 Venue A
11 1 Team 3 Venue B
12 2 Team 3 Venue A
13 3 Team 3 Venue B
14 4 Team 3 Venue B
15 5 Team 3 Venue B
我想要一个新列,显示每一行,该行中的团队在最后3轮中在该行的场地中播放的次数。
我可以通过for循环很容易地做到这一点。
window <- 3
for (i in 1:nrow(dat)){
# Create index to search (if i is less than window, start at 1)
index <- max(i - window, 1):i
# Search when current row matches both team and venue
dat$VenueCount[i] <- sum(dat$Team[i] == dat$Team[index] & dat$Venue[i] == dat$Venue[index])
}
Round Team Venue VenueCount
1 1 Team 1 Venue B 1
2 2 Team 1 Venue B 2
3 3 Team 1 Venue A 1
4 4 Team 1 Venue A 2
5 5 Team 1 Venue B 2
6 1 Team 2 Venue B 1
7 2 Team 2 Venue B 2
8 3 Team 2 Venue A 1
9 4 Team 2 Venue A 2
10 5 Team 2 Venue A 3
11 1 Team 3 Venue B 1
12 2 Team 3 Venue A 1
13 3 Team 3 Venue B 2
14 4 Team 3 Venue B 3
15 5 Team 3 Venue B 3
但是,我想避免使用for循环(主要是因为我的实际数据集在大约~30k行时相对较大)。我认为应该可以使用zoo
,dplyr
,purrr
或apply
之一,但尚无法解决问题。
由于
答案 0 :(得分:2)
在这里冒险data.table
解决方案。如果您只是在寻找dplyr
解决方案
您可以使用大小为4的窗口滚动,然后计算与最新行匹配的出现次数。
library(data.table)
library(zoo)
setDT(test)
winsize <- 4
test[, .(Round,
Venue,
VenueCount=rollapplyr(c(rep("", winsize-1), Venue), winsize,
function(x) sum(x==last(x)))),
by=.(Team)]
结果:
# Team Round Venue VenueCount
# 1: Team 1 1 Venue B 1
# 2: Team 1 2 Venue B 2
# 3: Team 1 3 Venue A 1
# 4: Team 1 4 Venue A 2
# 5: Team 1 5 Venue B 2
# 6: Team 2 1 Venue B 1
# 7: Team 2 2 Venue B 2
# 8: Team 2 3 Venue A 1
# 9: Team 2 4 Venue A 2
# 10: Team 2 5 Venue A 3
# 11: Team 3 1 Venue B 1
# 12: Team 3 2 Venue A 1
# 13: Team 3 3 Venue B 2
# 14: Team 3 4 Venue B 3
# 15: Team 3 5 Venue B 3
答案 1 :(得分:2)
我实际上是使用rollify
包裹tibbletime
使用dplyr::mutate
制作了答案。将在这里发布,但仍然对其他回复开放!
library(dplyr)
library(tibbletime)
# Create data
set.seed(123)
test <- data.frame(
Round = rep(1:5, times = 3),
Team = rep(c("Team 1", "Team 2", "Team 3"), each = 5),
Venue = sample(sample(c("Venue A", "Venue B"), 15, replace = T))
)
使用rollify
创建自定义功能。
last_n_games = 3
count_games <- rollify(function(x) sum(last(x) == x), window = last_n_games)
现在使用mutate来运行该函数。这返回前2行的NA(即last_n_games - 1
)。然后,我可以使用group_by
和row_number
来计算这些首次出现次数
test <- test %>%
group_by(Team) %>%
mutate(VenueCount = count_games(Venue)) %>%
group_by(Team, Venue) %>%
mutate(VenueCount = ifelse(is.na(VenueCount), row_number(Team), VenueCount))
返回以下内容
# A tibble: 15 x 4
# Groups: Team, Venue [6]
Round Team Venue VenueCount
<int> <fct> <fct> <int>
1 1 Team 1 Venue B 1
2 2 Team 1 Venue B 2
3 3 Team 1 Venue A 1
4 4 Team 1 Venue A 2
5 5 Team 1 Venue B 1
6 1 Team 2 Venue B 1
7 2 Team 2 Venue B 2
8 3 Team 2 Venue A 1
9 4 Team 2 Venue A 2
10 5 Team 2 Venue A 3
11 1 Team 3 Venue B 1
12 2 Team 3 Venue A 1
13 3 Team 3 Venue B 2
14 4 Team 3 Venue B 2
15 5 Team 3 Venue B 3
答案 2 :(得分:0)
所以我喜欢使用data.table,它速度快,功能多样。
这个想法是加入自己2次,有2个滞后(round+1)
和(round+2)
,所以这就是我所做的。
> test1<-test
> test2<-test
> test<-as.data.table(test)
> test1<-as.data.table(test1)
> test2<-as.data.table(test2)
获取副本后,将这些data.frames放入data.table
> test1[,Round:=Round+1,]
> test2[,Round:=Round+2,]
围绕滞后然后将它们连接在一起:
> test2[test1,on=c('Round','Team')][test,on=c('Round','Team')]
Round Team Venue i.Venue i.Venue.1
1: 1 Team 1 NA NA Venue B
2: 2 Team 1 NA Venue B Venue B
3: 3 Team 1 Venue B Venue B Venue A
4: 4 Team 1 Venue B Venue A Venue A
5: 5 Team 1 Venue A Venue A Venue B
6: 1 Team 2 NA NA Venue B
7: 2 Team 2 NA Venue B Venue B
8: 3 Team 2 Venue B Venue B Venue A
9: 4 Team 2 Venue B Venue A Venue A
10: 5 Team 2 Venue A Venue A Venue A
11: 1 Team 3 NA NA Venue B
12: 2 Team 3 NA Venue B Venue A
13: 3 Team 3 Venue B Venue A Venue B
14: 4 Team 3 Venue A Venue B Venue B
15: 5 Team 3 Venue B Venue B Venue B
由于这会产生很多NA,所以我们使用R-Cookbook.com ben mentioned in his answer
中的函数 compareNA <- function(v1,v2) {
# This function returns TRUE wherever elements are the same, including NA's,
# and false everywhere else.
same <- (v1 == v2) | (is.na(v1) & is.na(v2))
same[is.na(same)] <- FALSE
return(same)
}
我们可以得到我们的最终结果:
> end <-
test2[test1, on = c('Round', 'Team')][test, on = c('Round',
'Team')][, VenueCount :=
(1 + compareNA(i.Venue.1, i.Venue) + compareNA(i.Venue.1, Venue)), ]
说明:
test2
正确加入test1
,Round
和Team
,以及test
和Round
加入Team
,以便获得:
i.Venue.1
是Team
的当前地点,
i.Venue
是Team
的最后一个地点,
Venue
是Team
的最后2个地点,
带有逻辑
(1 + compareNA(i.Venue.1, i.Venue) + compareNA(i.Venue.1, Venue))
你可以计算球队在过去3轮比赛中在这个场地上的次数。
> end
Round Team Venue i.Venue i.Venue.1 VenueCount
1: 1 Team 1 NA NA Venue B 1
2: 2 Team 1 NA Venue B Venue B 2
3: 3 Team 1 Venue B Venue B Venue A 1
4: 4 Team 1 Venue B Venue A Venue A 2
5: 5 Team 1 Venue A Venue A Venue B 1
6: 1 Team 2 NA NA Venue B 1
7: 2 Team 2 NA Venue B Venue B 2
8: 3 Team 2 Venue B Venue B Venue A 1
9: 4 Team 2 Venue B Venue A Venue A 2
10: 5 Team 2 Venue A Venue A Venue A 3
11: 1 Team 3 NA NA Venue B 1
12: 2 Team 3 NA Venue B Venue A 1
13: 3 Team 3 Venue B Venue A Venue B 2
14: 4 Team 3 Venue A Venue B Venue B 2
15: 5 Team 3 Venue B Venue B Venue B 3
希望这会有所帮助