R - 过滤数据帧中的数据

时间:2012-01-18 05:16:12

标签: r filter dataframe

我是R的新人,真的不确定如何在日期框架中过滤数据。

我创建了一个包含两列的数据框,包括月度日期和相应的温度。它的长度为324。

> head(Nino3.4_1974_2000)
  Month_common               Nino3.4_degree_1974_2000_plain
1   1974-01-15                       -1.93025
2   1974-02-15                       -1.73535
3   1974-03-15                       -1.20040
4   1974-04-15                       -1.00390
5   1974-05-15                       -0.62550
6   1974-06-15                       -0.36915

过滤规则是选择大于或等于0.5度的温度。此外,它必须至少连续5个月。

我已经消除温度低于0.5度的数据(见下文)。

for (i in 1) {
el_nino=Nino3.4_1974_2000[which(Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain >= 0.5),]
}

> head(el_nino)
   Month_common               Nino3.4_degree_1974_2000_plain
32   1976-08-15                      0.5192000
33   1976-09-15                      0.8740000
34   1976-10-15                      0.8864501
35   1976-11-15                      0.8229501
36   1976-12-15                      0.7336500
37   1977-01-15                      0.9276500

然而,我仍然需要连续提取5个月。我希望有人可以帮助我。

2 个答案:

答案 0 :(得分:4)

如果您可以始终依赖间隔为一个月,那么让我们暂时放弃时间信息:

temps <- Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain

因此,由于该向量中的每个温度总是相隔一个月,我们只需查找temps[i]>=0.5的运行,并且运行必须至少为5个

如果我们执行以下操作:

ofinterest <- temps >= 0.5

我们将有一个值为ofinterest的向量TRUE FALSE FALSE TRUE TRUE ....,其中TRUEtemps[i]> = 0.5而FALSE时为TRUE

要重新解释您的问题,我们只需要查找连续至少五个rle 的出现次数。

为此,我们可以使用函数?rle> ?rle Description Compute the lengths and values of runs of equal values in a vector - or the reverse operation. Value: ‘rle()’ returns an object of class ‘"rle"’ which is a list with components: lengths: an integer vector containing the length of each run. values: a vector of the same length as ‘lengths’ with the corresponding values. 给出:

rle

因此我们使用TRUE计算一行中连续FALSE的所有条纹和一行中的连续TRUE,并查找至少5 # for you, temps <- Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain temps <- runif(1000) # make a vector that is TRUE when temperature is >= 0.5 and FALSE otherwise ofinterest <- temps >= 0.5 # count up the runs of TRUEs and FALSEs using rle: runs <- rle(ofinterest) # we need to find points where runs$lengths >= 5 (ie more than 5 in a row), # AND runs$values is TRUE (so more than 5 'TRUE's in a row). streakIs <- which(runs$lengths>=5 & runs$values) # these are all the el_nino occurences. # We need to convert `streakIs` into indices into our original `temps` vector. # To do this we add up all the `runs$lengths` up to `streakIs[i]` and that gives # the index into `temps`. # that is: # startMonths <- c() # for ( n in streakIs ) { # startMonths <- c(startMonths, sum(runs$lengths[1:(n-1)]) + 1 # } # # However, since this is R we can vectorise with sapply: startMonths <- sapply(streakIs, function(n) sum(runs$lengths[1:(n-1)])+1) 个一排。

我只是编写一些数据来证明:

Nino3.4_1974_2000$Month_common[startMonths]

现在,如果你做runs <- rle(Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain>=0.5) streakIs <- which(runs$lengths>=5 & runs$values) startMonths <- sapply(streakIs, function(n) sum(runs$lengths[1:(n-1)])+1) Nino3.4_1974_2000$Month_common[startMonths] ,你将获得厄尔尼诺开始的所有月份。

归结为几行:

{{1}}

答案 1 :(得分:1)

这是使用这样一个事实的一种方式:月份是常规的,总是相隔一个月。比问题减少到找到5个连续的行,其中temps> = 0.5度:

# Some sample data
d <- data.frame(Month=1:20, Temp=c(rep(1,6),0,rep(1,4),0,rep(1,5),0, rep(1,2)))
d

# Use rle to find runs of temps >= 0.5 degrees
x <- rle(d$Temp >= 0.5)

# The find the last row in each run of 5 or more
y <- x$lengths>=5 # BUG HERE: See update below!
lastRow <- cumsum(x$lengths)[y]

# Finally, deduce the first row and make a result matrix
firstRow <- lastRow - x$lengths[y] + 1L
res <- cbind(firstRow, lastRow) 
res
#     firstRow lastRow
#[1,]        1       6
#[2,]       13      17

更新我有一个错误,检测到5个值小于0.5的运行。这是更新的代码(和测试数据):

d <- data.frame(Month=1:20, Temp=c(rep(0,6),1,0,rep(1,4),0,rep(1,5),0, 1))
x <- rle(d$Temp >= 0.5)
y <- x$lengths>=5 & x$values
lastRow <- cumsum(x$lengths)[y]
firstRow <- lastRow - x$lengths[y] + 1L
res <- cbind(firstRow, lastRow) 
res
#     firstRow lastRow
#[2,]       14      18