使用data.table汇总每月序列(计算特定事件)

时间:2016-08-11 07:31:47

标签: r data.table time-series

我希望这是一个可接受的R / data.table问题。

我有一个3列表:

  • id地理位置ID(303,453个位置)
  • month月份超过25年1990-2014
  • spei气候指数在-7到7之间变化。

我需要计算整个1990 - 2014年期间每个地点干旱的发生情况。干旱事件定义为" SPEI持续为负且SPEI达到-1.0或更低值的时段。干扰在SPEI首次降至零以下时开始,并以-1.0或更低值"后的第一个正SPEI值结束。

我知道使用shift()和滚动连接这应该是可行的,但非常欢迎一些帮助!

# Sample table structure
dt <- data.table(
  id = rep(1:303453, each=25*12),
  month = rep(seq(as.Date("1990-01-01"), as.Date("2014-12-31"), "month"), 303453),
  spei = runif(303453*25*12, -7, 7))

# A minimal example with 1 location over 12 months
library(data.table)
library(xts)

dt <- data.table(
  id = rep("loc1", each=12),
  month = seq(as.Date("2014-01-01"), as.Date("2014-12-31"), "month"),
  spei = c(-2, -1.1, -0.5, 1.2, -1.2, 2.3, -1.7, -2.1, 0.9, 1.2, -0.9, -0.2))

spei.ts <- xts(dt$spei, order.by=dt$month, frequency="month")
plot(spei.ts, type="bars")

enter image description here

这显示了1年期间的3次干旱事件。这就是我需要识别和计算的内容。

希望你们中的一些人更习惯于使用时间序列。 非常感谢, - 梅尔。

2 个答案:

答案 0 :(得分:2)

这是获得所需结果的起点。 可能专家可以建议提高速度。

编辑:删除library(data.table) set.seed(42) n <- 300 # 303453 will be ~1000 times slower dt <- data.table( id = rep(1:n, each=25*12), month = rep(seq(as.Date("1990-01-01"), as.Date("2014-12-31"), "month"), n), spei = runif(n*25*12, -7, 7)) system.time({ dt[, `:=`(neg = (spei < 0), neg1 = (spei <= -1))] dt[, runid := ifelse(neg, rleid(neg), NA)] res <- dt[!is.na(runid), .(length = .N[any(neg1)], start = min(month), end = max(month)), by = .(id, runid)][!is.na(length)] }) # user system elapsed # 0.345 0.000 0.344 # counts of droughts per id: res[, .(nDroughts = .N), by = id] # list of droughts per id: (NB: don't include 1st positive value after) res[, .(droughtN = seq_len(.N), start, end), by = id] ,将速度提高了~8倍。

override func viewWillTransitionToSize(size: CGSize, withTransitionCoordinator coordinator: UIViewControllerTransitionCoordinator) {
    if UIDevice.currentDevice().orientation.isLandscape.boolValue {
        print("Landscape")
    } else {
        print("Portrait")
    }
}

答案 1 :(得分:2)

根据评论更新...

如果需要的只是计数

# Let 'sp' = starting point of potential drought
# Let 'dv' = drought level validation
# The cumsum just gives unique ids to group by.
dt[, sp := (spei <= 0) & (shift(spei, fill = 1) > 0), by = id]
dt[, dv := min(spei) <= -1, by = .(id, cumsum(sp))]
dt[sp & dv, .N, by = id]

然而,正如评论中所述,你已经去过那里,所以你已经看到了如何使用shift。既然你喜欢识别日期的想法。为什么不在那里使用shift

# Extending the previous columns...
dt[, ep := (shift(spei, type = "lead", fill = 1) > 0) & (spei <= 0), by = id]
cbind(dt[sp & dv, .(start = month), by = id],
      dt[ep & dv, .(end = month), by = id][,id := NULL])

如果您希望日期如图中的红线所示,则只需添加一个月,除非它是最后一个月。我们也可以得到长度...

# Extending the previous columns again...
dt[, end.month := shift(month, type = "lead", fill = month[.N]), by = id]
dt[, orig.id := .I]
starts <- dt[sp & dv][, did := .I]
ends <- dt[ep & dv][, did := .I]
starts[ends, on = "did"][
  ,.(id = id, length = 1 + i.orig.id - orig.id, start = month, end = i.end.month)]

会产生

     id length      start        end
1: loc1      3 2014-01-01 2014-04-01
2: loc1      1 2014-05-01 2014-06-01
3: loc1      2 2014-07-01 2014-09-01

它仍然!使用n=300

> microbenchmark(max = max.full(copy(dt))[, .(nDroughts = .N), by = id],
+                thellcounts = thell.counts(copy(dt)),
+                thell .... [TRUNCATED] 
Unit: milliseconds
        expr       min        lq      mean    median        uq        max neval
         max 218.19152 220.30895 342.18605 222.75507 250.36644 1350.15847    10
 thellcounts  20.36785  22.27349  28.45167  23.39313  24.38610   78.25046    10
  thelldates  28.24378  28.64849  30.59897  30.57793  31.25352   34.51569    10
 thelldates2  36.19724  39.79588  42.34457  41.52455  42.41872   57.28073    10

使用n=3000

> microbenchmark(max = max.full(copy(dt))[, .(nDroughts = .N), by = id],
+                thellcounts = thell.counts(copy(dt)),
+                thell .... [TRUNCATED] 
Unit: milliseconds
        expr       min        lq      mean    median        uq       max neval
         max 2126.1138 2148.3453 2207.7801 2205.3536 2241.2848 2340.1203    10
 thellcounts  197.7312  202.4817  234.2949  205.4828  304.1556  309.1028    10
  thelldates  261.9889  264.5597  283.9970  266.1244  267.8603  374.6406    10
 thelldates2  320.6352  331.7558  374.4110  340.2668  439.1490  441.8473    10