对于每个组(individual_id),对于每个week_id,我想计算该人在每个城市的前X周内出场的次数。
我已经尝试过dplyr,但无济于事。我已经尝试过一个循环,但是它会永远占用我正在使用的数据集(在20个城市中有大约250,000个观测值,> 1000个个体。尤其是我想查询前两年的出现次数(即X = 104)周)。
theDates = as.Date(c('07/05/2017','07/05/2017', '07/05/2017', '14/05/2017', '14/05/2017',
'21/05/2017','21/05/2017','21/05/2017', '28/05/2017', '04/06/2017', '04/06/2017', '04/06/2017', '11/06/2017',
'18/06/2017', '18/06/2017'), format='%d/%m/%Y')
someData = data.frame(individual_id = c(1,2,3,2,3,1,2,3,3,1,2,3,3,2,3), week_end_date=theDates,
city=c('Chicago','Chicago','Chicago','Washington', 'Washington', 'Chicago','Chicago', 'Chicago','Washington',
'Washington', 'Washington','Washington','Chicago','Washington', 'Washington'))
someData$nChicagoAppearancesInLastXweeks = NA
someData$nWashingtonAppearancesInLastXweeks = NA
X = 4 # this is the number of weeks for the window length
someData$start_of_period_date = someData$week_end_date - 7*X # this is the start of the range of dates to count appearances over
for (i in 1:dim(someData)[1]) {
WEEK_IDS = seq(someData$start_of_period_date[i], someData$week_end_date[i]-1, by='days')
INDIVIDUAL_ID = someData$individual_id[i]
someData$nChicagoAppearancesInLastXweeks[i] = sum(ifelse(someData$city=='Chicago' & someData$individual_id == INDIVIDUAL_ID & someData$week_end_date %in% WEEK_IDS,1,0))
someData$nWashingtonAppearancesInLastXweeks[i] = with(someData, sum(ifelse(city=='Washington' & individual_id == INDIVIDUAL_ID & week_end_date %in% c(WEEK_IDS),1,0)))
}
预期输出将是两个新列,给出在过去X周中每个person_id在每个城市出现的次数。循环代码可以做到这一点,但这显然不是最好的方法。
答案 0 :(得分:1)
为每个添加的列执行左连接:
library(sqldf)
X <- 4
sql <- "select sum(not b.city is null)
from someData a
left join someData b on
b.city == '$lev' and
a.[individual_id] = b.[individual_id] and
b.[week_end_date] between a.[week_end_date] - 7 * $X and a.[week_end_date] - 1
group by a.rowid"
for(lev in levels(someData$city)) someData[lev] <- fn$sqldf(sql)
给予:
> someData
individual_id week_end_date city Chicago Washington
1 1 2017-05-07 Chicago 0 0
2 2 2017-05-07 Chicago 0 0
3 3 2017-05-07 Chicago 0 0
4 2 2017-05-14 Washington 1 0
5 3 2017-05-14 Washington 1 0
6 1 2017-05-21 Chicago 1 0
7 2 2017-05-21 Chicago 1 1
8 3 2017-05-21 Chicago 1 1
9 3 2017-05-28 Washington 2 1
10 1 2017-06-04 Washington 2 0
11 2 2017-06-04 Washington 2 1
12 3 2017-06-04 Washington 2 2
13 3 2017-06-11 Chicago 1 3
14 2 2017-06-18 Washington 1 1
15 3 2017-06-18 Washington 2 2