df1中的总和值基于df2中的日期范围

时间:2018-07-11 16:43:41

标签: r

我试图返回另一个数据帧中两个日期之间一个数据帧的值之和。 Stack中提供的答案似乎不适用于我的应用程序。我尝试使用data.table但无济于事,所以去了。

创建日期范围

MeanRemaining <- seq(as.Date("2017-01-01"),as.Date("2017-02-28"),2)
MeanRemaining<-as.data.frame(cbind(MeanRemaining,lag(MeanRemaining)))
colnames(MeanRemaining)<-c("InspDate", "PrevInspDate")
MeanRemaining$InspDate<-as.Date(MeanRemaining$InspDate, origin = "1970/01/01")
MeanRemaining$PrevInspDate<-as.Date(MeanRemaining$PrevInspDate, origin = "1970/01/01")

重要的是,日期范围实际上并没有像上面那样固定,并且可能是相隔大约一周的任何范围。

创建要求和的值

DailyTonnes <- as.data.frame(cbind(as.data.frame(seq(as.Date
+ ("2016-12-01"),as.Date("2017-03-28"),1)),(replicate(1,sample(abs(rnorm(118))*1000,rep=TRUE)))))
colnames(DailyTonnes)<-c("date","Vol")

目标

我想对“ MeanRemaining”中每个日期范围之间的“ DailyTonnes”中的“ Vol”求和,并将总“ Vol”返回到“ MeanRemaining”中的相应行。

在我尝试过的类似问题的帮助下

library(data.table)
setDT(MeanRemaining)
setDT(DailyTonnes)

MeanRemaining[DailyTonnes[MeanRemaining, sum(Vol), on = .(date >= InspDate, date <= PrevInspDate),
            by = .EACHI], TotalVol := V1, on = .(InspDate=date)]

但是这会返回NA值。

任何建议将不胜感激。

1 个答案:

答案 0 :(得分:1)

我相信您的问题包含了答案所需的所有内容。

我稍微完善了您的代码并更改了最后一行(这是唯一的错误代码)。最后一行的连接过于复杂,我认为它不会带来任何内存/性能提升。

library(data.table)
# Create MeanRemaining
MeanRemaining <-
  data.table(InspDate = seq(as.Date("2017-01-01"), as.Date("2017-02-28"), 2))
# I changed lag by shift, I think it is clearer this way
MeanRemaining[, PrevInspDate := shift(InspDate, type = "lead", fill = 1000000L)]

# set seed for repetibility
set.seed(13)
# Create DailyTonnes, I changed the end date to generate empty intervals
DailyTonnes <- data.table(date = seq(as.Date("2016-12-01"),
                                     as.Date("2017-01-28"), 1),
                          Vol = sample(abs(rnorm(118)) * 1000, rep = TRUE))

# I changed the <= condition to <, I think it fits PrevInspDate better
# This should be your final result if I'm not wrong
SingleCase <-
  DailyTonnes[MeanRemaining, sum(Vol), on = .(date >= InspDate, date < PrevInspDate), by = .EACHI]

# SingleCase has two variables called date, this may be a small bug in data.table
print(names(SingleCase))

# change the names of the data.table to suit your needs
names(SingleCase) <- c("InspDate", "PrevInspDate", "TotalVol")

编辑:从表MeanRemaining中恢复多个变量

从MeanRemaining检索多个变量的情况非常棘手。少量变量很容易解决:

# Add variables to MeanRemaining
for (i in 1:100) {
  MeanRemaining[, paste0("extracol", i) := sample(.N)]
}

# Two variable case
smallmultiple <-
  DailyTonnes[MeanRemaining, list(TotalVol = sum(Vol),
                                  extracol1 = i.extracol1 ,
                                  extracol2 = i.extracol2), on = .(date >= InspDate, date < PrevInspDate), by = .EACHI]

# Correct date names
names(smallmultiple)[1:2] <- c("InspDate", "PrevInspDate")

涉及很多变量时,它变得很难。有this feature request in github个可以解决您的问题,但目前不可用。 This question面临类似的问题,但不能用于您的情况。

处理大量变量的方法是:

# obtain names of variables to be kept in the later join
joinkeepcols <-
  setdiff(names(MeanRemaining),  c("InspDate", "PrevInspDate"))

# the "i" indicates the table to take the variables from
joinkeepcols2 <- paste0("i.", joinkeepcols)

# Prepare a expression for the data.table environment
keepcols <-
  paste(paste(joinkeepcols, joinkeepcols2, sep = " = "), collapse = ", ")

# Complete expression to be evaluated in data.table
evalexpression <- paste0("list(
                         TotalVol = sum(Vol),",
                         keepcols, ")")

# The magic comes here (you can assign it to MeanRemaining)
bigmultiple <-
  DailyTonnes[MeanRemaining, eval(parse(text = evalexpression)), on = .(date >= InspDate, date < PrevInspDate), by = .EACHI]

# Correct date names
names(bigmultiple)[1:2] <- c("InspDate", "PrevInspDate")