R:在时间间隔内计算日期

时间:2015-12-28 09:50:45

标签: r date

假设我们有数据输入:

<div class="tab-wrap">
  <div class="parrent pull-left">
    <ul class="nav nav-tabs nav-stacked">
      <li class="active"><a href="#tab1" data-toggle="tab" class="analistic-01">Tab 1</a></li>
      <li class=""><a href="#tab2" data-toggle="tab" class="analistic-02">Tab 2</a></li>
      <li class=""><a href="#tab3" data-toggle="tab" class="analistic-03">Tab 3</a></li>
    </ul>
  </div>
  <div class="tab-content">
    <div class="tab-pane active in" id="tab1">
      <p> hello 1</p>
    </div>

    <div class="tab-pane" id="tab2">
      <p> hello 2</p>
    </div>

    <div class="tab-pane" id="tab3">
      <p>Hello 3</p>
    </div>
  </div> <!--/.tab-content-->
</div><!--/.tab-wrap-->

目标是计算所有事件的日期发生次数(包括开始,排除结束)。填写此数据框:

buildTypes {
    release {
        proguardFiles getDefaultProguardFile('proguard-android.txt'),
        'proguard-rules.pro'
    }
debug {
        proguardFiles getDefaultProguardFile('proguard-android.txt'),
        'proguard-rules.pro'
    }
}

从概念上讲,它看起来像这样:

df.in <- data.frame(event = c(1,2,3,4,5), 
                    start = c("2015-01-01", "2015-01-01", "2015-01-02",
                              "2015-01-02", "2015-01-03"),
                    end = c("2015-01-03", "2015-01-04", "2015-01-03",
                            "2015-01-05", "2015-01-05"))
df.in$start <- as.Date(df.in$start, "%Y-%m-%d")
df.in$end <- as.Date(df.in$end, "%Y-%m-%d")

> df.in
  event      start        end
1     1 2015-01-01 2015-01-03
2     2 2015-01-01 2015-01-04
3     3 2015-01-02 2015-01-03
4     4 2015-01-02 2015-01-05
5     5 2015-01-03 2015-01-05

所以,我目前的想法是循环:

df.out <- data.frame(date = c("2015-01-01", "2015-01-02", "2015-01-03", 
                              "2015-01-04", "2015-01-05"),
                     count = 0)
df.out$date <- as.Date(df.out$date, "%Y-%m-%d")
> df.out
        date count
1 2015-01-01     0
2 2015-01-02     0
3 2015-01-03     0
4 2015-01-04     0
5 2015-01-05     0

它有效,但我有点害怕我所援引的这个#1 ** #2 **** #3 *** #4 ** #5 可能会滚雪球变成非常大的东西。鉴于事件数量很容易达到数十甚至数十万。

所以我的问题是 - 能有更有效的方法吗?也许通过使用一些日期包,如for(i in seq_along(df.out$date)){ temp.df <- df.in[df.in$start <= df.out$date[i],] df.out$count[i] <- nrow(temp.df) - nrow(temp.df[temp.df$end <= df.out$date[i],]) } > df.out date count 1 2015-01-01 2 2 2015-01-02 4 3 2015-01-03 3 4 2015-01-04 2 5 2015-01-05 0 ,我可以在某种程度上矢量化整个事情?

1 个答案:

答案 0 :(得分:2)

所以我已经完成了对data.table::foverlaps()的研究。我会把我的发现留给任何可能发现它有用的人,因为我在搜索类似帖子时并没有真正找到这些小东西。

鉴于我们正在比较区间,并且我们只在y参数上有间隔,在这种特殊情况下是df.in - 我们必须人为地制作一个区间。例如在df.out$date2 <- df.out$date中。此外,没有简单的(或我无法找到任何)方式来设置包含或排除设置间隔端点。鉴于我们要在df.in$end中排除端点,我们必须使用简单的df.in$end <- df.in$end - 1在数据表本身上手动执行此操作。

长话短说,这是一个有效的例子:

require(data.table)
df.out <- data.table(date = c("2015-01-01", "2015-01-02", "2015-01-03", 
                              "2015-01-04", "2015-01-05"),
                     count = 0)
df.out$date <- as.Date(df.out$date, "%Y-%m-%d")

df.in <- data.table(event = c(1,2,3,4,5), 
                    start = c("2015-01-01", "2015-01-01", "2015-01-02",
                              "2015-01-02", "2015-01-03"),
                    end = c("2015-01-03", "2015-01-04", "2015-01-03",
                            "2015-01-05", "2015-01-05"))
df.in$start <- as.Date(df.in$start, "%Y-%m-%d")
df.in$end <- as.Date(df.in$end, "%Y-%m-%d") - 1

setkey(df.in, start, end)
df.out$date2 <- df.out$date
df.test <- foverlaps(x = df.out, y = df.in, type = "within", by.x = c("date", "date2"), by.y = c("start", "end"))
df.test$count[!is.na(df.test$event)] <- 1
aggregate(count ~ date, data = df.test, sum)

        date count
1 2015-01-01     2
2 2015-01-02     4
3 2015-01-03     3
4 2015-01-04     2
5 2015-01-05     0

或者,您可以

数据

df.out <- data.table(date = as.Date(c("2015-01-01", "2015-01-02", "2015-01-03", 
                              "2015-01-04", "2015-01-05")))

df.in <- data.table(event = 1:5, 
                    start = as.Date(c("2015-01-01", "2015-01-01", "2015-01-02",
                              "2015-01-02", "2015-01-03")),
                    end = as.Date(c("2015-01-03", "2015-01-04", "2015-01-03",
                            "2015-01-05", "2015-01-05")))

解决方案

df.out[, `:=`(start = date, end = date)]
df.in[, end := end - 1L]
setkey(df.out, start, end)
foverlaps(df.in, df.out)[, .(count = .N), by = date]
#          date count
# 1: 2015-01-01     2
# 2: 2015-01-02     4
# 3: 2015-01-03     3
# 4: 2015-01-04     2

,如果您想更新df.out,您也可以

res <- foverlaps(df.in, df.out, which = TRUE)[, .N, by = yid]
df.out[res$yid, Count := res$N]
df.out[is.na(Count), Count := 0L]