我在R中有两个数据帧。
发布数据框
Date Product
2011-01-13 A
2011-02-15 A
2011-01-14 B
2011-02-15 B
Casedata数据框
Date Product Numberofcases
2011-01-13 A 50
2011-01-12 A 20
2011-01-11 A 100
2011-01-10 A 120
2011-01-09 A 150
2011-01-08 A 180
2011-01-07 A 200
2011-01-06 A 220
2011-01-23 A 500
2011-01-31 A 450
2011-02-08 A 50
2011-02-09 A 1000
2011-02-10 A 1200
2011-02-11 A 1500
2011-02-12 A 1800
2011-02-13 A 2000
2011-02-14 A 2200
2011-02-15 A 5000
2011-01-31 A 4500
:::
:::
2011-01-15 B 1000
我的要求是每个产品发布日期(从发布数据框),我应该在发布日期前一周(在casedata数据框中)获得相应的总和(numberofcases)。即,对于产品A和发布日期2011-01-13,它应该是前一周(从2011-01-06到2011-01-13)的所有案例的总和,即。,(50 + 20 + 100 + 120 + 150 + 180 + 200 + 220)
Releasedate Product Numberofcasesoneweekpriorrelease
2011-01-13 A 1040
2011-02-15 A 19250
2011-01-14 B ...
2011-02-15 B ...
我尝试过:
beforerelease <- sqldf("select product,release.date_release,sum(numberofcasescreated) as numberofcasesbeforerelease from release left join casedata using (product) where date_case>=weekbeforerelease and date_case<=date_release group by product,date_release")
finaldf <- merge(beforerelease,afterelease,by=c("monthyear","product"))
我很震惊,并没有给我预期的结果。有人可以帮帮我吗?
答案 0 :(得分:5)
使用non-equi
data.table, v1.9.7
中最近实现的Date
联接功能,可以简单地执行此操作(假设所有Date列都属于require(data.table)
setDT(release)[, Date2 := Date-7L]
setDT(casedata)[release, on = .(Product, Date >= Date2, Date <= Date),
.(count = sum(Numberofcases)), by = .EACHI]
# Product Date Date count
# 1: A 2011-01-06 2011-01-13 1040
# 2: A 2011-02-08 2011-02-15 14750
# 3: B 2011-01-07 2011-01-14 NA
# 4: B 2011-02-08 2011-02-15 NA
类):< / p>
{{1}}
答案 1 :(得分:3)
使用data.table
包,您可以采用以下两种方法:
1)使用foverlaps
功能:
library(data.table)
# convert to a 'data.table' with 'setDT()'
# and create a release window
setDT(release)[, `:=` (bdat = as.Date(Date)-7, edat = as.Date(Date))][, Date := NULL]
# convert to a 'data.table' and create a 2nd date column for use with 'foverlaps
setDT(casedata)[, `:=` (bdat = as.Date(Date), edat = as.Date(Date))][, Date := NULL]
# set the key for use in 'foverlaps'
setkey(release, Product, bdat, edat)
setkey(casedata, Product, bdat, edat)
# do an overlap join ('foverlaps') and summarise
foverlaps(casedata, release, type = 'within', nomatch = 0L)[, .(cases.prior.release = sum(Numberofcases)), by = .(Product, release.date = edat)]
给出:
Product release.date cases.prior.release
1: A 2011-01-13 1040
2: A 2011-02-15 14750
2)使用data.table
的标准联接功能:
setDT(release)
setDT(casedata)
casedata[, Date := as.Date(Date)
][release[, `:=` (Date = as.Date(Date), idx = .I)
][, .(dates = seq(Date-7,Date,'day')), by = .(Product,idx)],
on = c('Product', Date = 'dates'), nomatch = 0L
][, .(releasedate = Date[.N], cases.prior.release = sum(Numberofcases)), by = .(Product,idx)
][, idx := NULL]
会得到相同的结果。
使用过的数据:
release <- structure(list(Date = c("2011-01-13", "2011-02-15", "2011-01-14", "2011-02-15"),
Product = c("A", "A", "B", "B")),
.Names = c("Date", "Product"), class = "data.frame", row.names = c(NA, -4L))
casedata <- structure(list(Date = c("2011-01-13", "2011-01-12", "2011-01-11", "2011-01-10", "2011-01-09", "2011-01-08", "2011-01-07", "2011-01-06", "2011-01-23", "2011-01-31", "2011-02-08", "2011-02-09", "2011-02-10", "2011-02-11", "2011-02-12", "2011-02-13", "2011-02-14", "2011-02-15", "2011-01-31"),
Product = c("A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A"),
Numberofcases = c(50L, 20L, 100L, 120L, 150L, 180L, 200L, 220L, 500L, 450L, 50L, 1000L, 1200L, 1500L, 1800L, 2000L, 2200L, 5000L, 4500L)),
.Names = c("Date", "Product", "Numberofcases"), class = "data.frame", row.names = c(NA, -19L))