我希望这是一个可接受的R / data.table问题。
我有一个3列表:
id
地理位置ID(303,453个位置)month
月份超过25年1990-2014 spei
气候指数在-7到7之间变化。我需要计算整个1990 - 2014年期间每个地点干旱的发生情况。干旱事件定义为" SPEI持续为负且SPEI达到-1.0或更低值的时段。干扰在SPEI首次降至零以下时开始,并以-1.0或更低值"后的第一个正SPEI值结束。
我知道使用shift()和滚动连接这应该是可行的,但非常欢迎一些帮助!
# Sample table structure
dt <- data.table(
id = rep(1:303453, each=25*12),
month = rep(seq(as.Date("1990-01-01"), as.Date("2014-12-31"), "month"), 303453),
spei = runif(303453*25*12, -7, 7))
# A minimal example with 1 location over 12 months
library(data.table)
library(xts)
dt <- data.table(
id = rep("loc1", each=12),
month = seq(as.Date("2014-01-01"), as.Date("2014-12-31"), "month"),
spei = c(-2, -1.1, -0.5, 1.2, -1.2, 2.3, -1.7, -2.1, 0.9, 1.2, -0.9, -0.2))
spei.ts <- xts(dt$spei, order.by=dt$month, frequency="month")
plot(spei.ts, type="bars")
这显示了1年期间的3次干旱事件。这就是我需要识别和计算的内容。
希望你们中的一些人更习惯于使用时间序列。 非常感谢, - 梅尔。
答案 0 :(得分:2)
这是获得所需结果的起点。 可能专家可以建议提高速度。
编辑:删除library(data.table)
set.seed(42)
n <- 300 # 303453 will be ~1000 times slower
dt <- data.table(
id = rep(1:n, each=25*12),
month = rep(seq(as.Date("1990-01-01"), as.Date("2014-12-31"), "month"), n),
spei = runif(n*25*12, -7, 7))
system.time({
dt[, `:=`(neg = (spei < 0), neg1 = (spei <= -1))]
dt[, runid := ifelse(neg, rleid(neg), NA)]
res <- dt[!is.na(runid),
.(length = .N[any(neg1)], start = min(month), end = max(month)),
by = .(id, runid)][!is.na(length)]
})
# user system elapsed
# 0.345 0.000 0.344
# counts of droughts per id:
res[, .(nDroughts = .N), by = id]
# list of droughts per id: (NB: don't include 1st positive value after)
res[, .(droughtN = seq_len(.N), start, end), by = id]
,将速度提高了~8倍。
override func viewWillTransitionToSize(size: CGSize, withTransitionCoordinator coordinator: UIViewControllerTransitionCoordinator) {
if UIDevice.currentDevice().orientation.isLandscape.boolValue {
print("Landscape")
} else {
print("Portrait")
}
}
答案 1 :(得分:2)
根据评论更新...
如果需要的只是计数
# Let 'sp' = starting point of potential drought
# Let 'dv' = drought level validation
# The cumsum just gives unique ids to group by.
dt[, sp := (spei <= 0) & (shift(spei, fill = 1) > 0), by = id]
dt[, dv := min(spei) <= -1, by = .(id, cumsum(sp))]
dt[sp & dv, .N, by = id]
然而,正如评论中所述,你已经去过那里,所以你已经看到了如何使用shift
。既然你喜欢识别日期的想法。为什么不在那里使用shift
?
# Extending the previous columns...
dt[, ep := (shift(spei, type = "lead", fill = 1) > 0) & (spei <= 0), by = id]
cbind(dt[sp & dv, .(start = month), by = id],
dt[ep & dv, .(end = month), by = id][,id := NULL])
如果您希望日期如图中的红线所示,则只需添加一个月,除非它是最后一个月。我们也可以得到长度...
# Extending the previous columns again...
dt[, end.month := shift(month, type = "lead", fill = month[.N]), by = id]
dt[, orig.id := .I]
starts <- dt[sp & dv][, did := .I]
ends <- dt[ep & dv][, did := .I]
starts[ends, on = "did"][
,.(id = id, length = 1 + i.orig.id - orig.id, start = month, end = i.end.month)]
会产生
id length start end
1: loc1 3 2014-01-01 2014-04-01
2: loc1 1 2014-05-01 2014-06-01
3: loc1 2 2014-07-01 2014-09-01
它仍然快!使用n=300
> microbenchmark(max = max.full(copy(dt))[, .(nDroughts = .N), by = id],
+ thellcounts = thell.counts(copy(dt)),
+ thell .... [TRUNCATED]
Unit: milliseconds
expr min lq mean median uq max neval
max 218.19152 220.30895 342.18605 222.75507 250.36644 1350.15847 10
thellcounts 20.36785 22.27349 28.45167 23.39313 24.38610 78.25046 10
thelldates 28.24378 28.64849 30.59897 30.57793 31.25352 34.51569 10
thelldates2 36.19724 39.79588 42.34457 41.52455 42.41872 57.28073 10
使用n=3000
> microbenchmark(max = max.full(copy(dt))[, .(nDroughts = .N), by = id],
+ thellcounts = thell.counts(copy(dt)),
+ thell .... [TRUNCATED]
Unit: milliseconds
expr min lq mean median uq max neval
max 2126.1138 2148.3453 2207.7801 2205.3536 2241.2848 2340.1203 10
thellcounts 197.7312 202.4817 234.2949 205.4828 304.1556 309.1028 10
thelldates 261.9889 264.5597 283.9970 266.1244 267.8603 374.6406 10
thelldates2 320.6352 331.7558 374.4110 340.2668 439.1490 441.8473 10