我是R的新人,真的不确定如何在日期框架中过滤数据。
我创建了一个包含两列的数据框,包括月度日期和相应的温度。它的长度为324。
> head(Nino3.4_1974_2000)
Month_common Nino3.4_degree_1974_2000_plain
1 1974-01-15 -1.93025
2 1974-02-15 -1.73535
3 1974-03-15 -1.20040
4 1974-04-15 -1.00390
5 1974-05-15 -0.62550
6 1974-06-15 -0.36915
过滤规则是选择大于或等于0.5度的温度。此外,它必须至少连续5个月。
我已经消除温度低于0.5度的数据(见下文)。
for (i in 1) {
el_nino=Nino3.4_1974_2000[which(Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain >= 0.5),]
}
> head(el_nino)
Month_common Nino3.4_degree_1974_2000_plain
32 1976-08-15 0.5192000
33 1976-09-15 0.8740000
34 1976-10-15 0.8864501
35 1976-11-15 0.8229501
36 1976-12-15 0.7336500
37 1977-01-15 0.9276500
然而,我仍然需要连续提取5个月。我希望有人可以帮助我。
答案 0 :(得分:4)
如果您可以始终依赖间隔为一个月,那么让我们暂时放弃时间信息:
temps <- Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain
因此,由于该向量中的每个温度总是相隔一个月,我们只需查找temps[i]>=0.5
的运行,并且运行必须至少为5个
如果我们执行以下操作:
ofinterest <- temps >= 0.5
我们将有一个值为ofinterest
的向量TRUE FALSE FALSE TRUE TRUE ....
,其中TRUE
当temps[i]
> = 0.5而FALSE
时为TRUE
。
要重新解释您的问题,我们只需要查找连续至少五个rle
的出现次数。
为此,我们可以使用函数?rle
。 > ?rle
Description
Compute the lengths and values of runs of equal values in a vector
- or the reverse operation.
Value:
‘rle()’ returns an object of class ‘"rle"’ which is a list with
components:
lengths: an integer vector containing the length of each run.
values: a vector of the same length as ‘lengths’ with the
corresponding values.
给出:
rle
因此我们使用TRUE
计算一行中连续FALSE
的所有条纹和一行中的连续TRUE
,并查找至少5 # for you, temps <- Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain
temps <- runif(1000)
# make a vector that is TRUE when temperature is >= 0.5 and FALSE otherwise
ofinterest <- temps >= 0.5
# count up the runs of TRUEs and FALSEs using rle:
runs <- rle(ofinterest)
# we need to find points where runs$lengths >= 5 (ie more than 5 in a row),
# AND runs$values is TRUE (so more than 5 'TRUE's in a row).
streakIs <- which(runs$lengths>=5 & runs$values)
# these are all the el_nino occurences.
# We need to convert `streakIs` into indices into our original `temps` vector.
# To do this we add up all the `runs$lengths` up to `streakIs[i]` and that gives
# the index into `temps`.
# that is:
# startMonths <- c()
# for ( n in streakIs ) {
# startMonths <- c(startMonths, sum(runs$lengths[1:(n-1)]) + 1
# }
#
# However, since this is R we can vectorise with sapply:
startMonths <- sapply(streakIs, function(n) sum(runs$lengths[1:(n-1)])+1)
个一排。
我只是编写一些数据来证明:
Nino3.4_1974_2000$Month_common[startMonths]
现在,如果你做runs <- rle(Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain>=0.5)
streakIs <- which(runs$lengths>=5 & runs$values)
startMonths <- sapply(streakIs, function(n) sum(runs$lengths[1:(n-1)])+1)
Nino3.4_1974_2000$Month_common[startMonths]
,你将获得厄尔尼诺开始的所有月份。
归结为几行:
{{1}}
答案 1 :(得分:1)
这是使用这样一个事实的一种方式:月份是常规的,总是相隔一个月。比问题减少到找到5个连续的行,其中temps> = 0.5度:
# Some sample data
d <- data.frame(Month=1:20, Temp=c(rep(1,6),0,rep(1,4),0,rep(1,5),0, rep(1,2)))
d
# Use rle to find runs of temps >= 0.5 degrees
x <- rle(d$Temp >= 0.5)
# The find the last row in each run of 5 or more
y <- x$lengths>=5 # BUG HERE: See update below!
lastRow <- cumsum(x$lengths)[y]
# Finally, deduce the first row and make a result matrix
firstRow <- lastRow - x$lengths[y] + 1L
res <- cbind(firstRow, lastRow)
res
# firstRow lastRow
#[1,] 1 6
#[2,] 13 17
更新我有一个错误,检测到5个值小于0.5的运行。这是更新的代码(和测试数据):
d <- data.frame(Month=1:20, Temp=c(rep(0,6),1,0,rep(1,4),0,rep(1,5),0, 1))
x <- rle(d$Temp >= 0.5)
y <- x$lengths>=5 & x$values
lastRow <- cumsum(x$lengths)[y]
firstRow <- lastRow - x$lengths[y] + 1L
res <- cbind(firstRow, lastRow)
res
# firstRow lastRow
#[2,] 14 18