子集时间序列,用于获取列表中连续时间序列的开始和结束

时间:2016-10-18 21:00:41

标签: r time-series

它似乎很容易,但经过很长一段时间的搜索和尝试,我没有得到它:

我有一个时间序列列表,一个简短的再现示例:

a <- seq(as.Date("1970-01-01"), as.Date("1970-01-05"), "days")
b <- seq(as.Date("1985-10-01"), as.Date("1985-10-05"), "days")
c <- seq(as.Date("2014-03-01"), as.Date("2014-03-05"), "days")
d <- c(a, b, c)
df1 <- data.frame(d)
colnames(df1) <- c("date")
e <- seq(as.Date("1975-01-01"), as.Date("1975-01-05"), "days")
f <- seq(as.Date("1990-10-01"), as.Date("1990-10-05"), "days")
g <- c(e, f)
df2 <- data.frame(g)
colnames(df2) <- c("date")
ll <- list(df1, df2)

现在我想将列出的data.frames子集化为:

> llsubset
[[1]]
        date
1 1970-01-01
2 1970-01-05
3 1985-10-01
4 1985-10-05
5 2014-03-01
6 2014-03-05

[[2]]
        date
1 1975-01-01
2 1975-01-05
3 1990-10-01
4 1990-10-05

我已经rollapply尝试了它,但它不起作用且不值得一看。也许你可以帮帮我?谢谢!

4 个答案:

答案 0 :(得分:3)

确定哪些点与之前的差异超过1天,并且从该构造开始,逻辑在每个序列的末尾为TRUE,在其他位置为FALSE。由它子集。没有包使用。

lapply(ll, subset, { dif <- diff(date) > 1; c(TRUE, dif) | c(dif, TRUE) } )

,并提供:

[[1]]
         date
1  1970-01-01
5  1970-01-05
6  1985-10-01
10 1985-10-05
11 2014-03-01
15 2014-03-05

[[2]]
         date
1  1975-01-01
5  1975-01-05
6  1990-10-01
10 1990-10-05

答案 1 :(得分:1)

也许是这样的?使用cumsumdiff创建一个组变量,然后对日期进行子集化(假设您要查找每个连续时间段内的最小和最大日期,date按升序排序事先订购):

library(dplyr)
lapply(ll, function(df) {
            df %>% 
                  group_by(cumsum(c(TRUE, diff(date) != 1))) %>% 
                  slice(c(1, n())) %>% 
                  ungroup() %>% 
                  select(date) }
      )

#[[1]]
# A tibble: 6 × 1
#        date
#      <date>
#1 1970-01-01
#2 1970-01-05
#3 1985-10-01
#4 1985-10-05
#5 2014-03-01
#6 2014-03-05

#[[2]]
# A tibble: 4 × 1
#        date
#      <date>
#1 1975-01-01
#2 1975-01-05
#3 1990-10-01
#4 1990-10-05

答案 2 :(得分:0)

可能有一个包正是如此,但我还不知道它的名字。

在日期上使用diff()可以突出显示哪些日期之间只有一天,如下所示:

diff(df1$date)
Time differences in days
 [1]     1     1     1     1  5748     1     1     1     1 10374     1
[12]     1     1     1

我们可以使用它。

end_finder <- function(x) {
  # find the gap between dates.
  # mark dates where the diff > 1,
  # also mark the entry prior to that one;
  # this will be the end of the previous date.
  # also include the first and last element.

  diff_dates <- c(100,diff(x$dates))
  diff_idx <- which(diff_dates > 1)
  diff_idx <- c((diff_idx -1 ), diff_idx)
  # remove any elements < 1
  diff_idx <- diff_idx[diff_idx >= 1 ]
  # include the first element
  diff_idx <- c(1, diff_idx)
  # include the last element
  diff_idx <- c(diff_idx, length(x$date))
  # remove duplicates and sort for easier reading
  diff_idx <- sort(unique(diff_idx))
  x$dates[diff_idx]
}

现在运行。

> lapply(ll, end_finder)
[[1]]
[1] "1970-01-01" "1970-01-05" "1985-10-01" "1985-10-05" "2014-03-01"
[6] "2014-03-05"

[[2]]
[1] "1975-01-01" "1975-01-05" "1990-10-01" "1990-10-05"

答案 3 :(得分:0)

使用dplyr的另一种解决方案:首先我们计算每个日期的年份,并且每年我们找到最小和最大日期 分别使用来自lubridate和reshape2包的yearmelt函数

library(dplyr)
library(lubridate)
library(reshape2)

ll <- list(df1, df2)


fn_endPoint_Years = function(DF) {

newDF = DF %>%  
mutate(Year=year(date)) %>% 
group_by(Year) %>% 
do(.,data.frame(minDate=min(.$date),maxDate=max(.$date) )) %>% 
melt(id="Year",value.name = "date") %>% 
arrange(date) %>% 
select(date)

}

lapply(ll,fn_endPoint_Years)

# [[1]]
        # date
# 1 1970-01-01
# 2 1970-01-05
# 3 1985-10-01
# 4 1985-10-05
# 5 2014-03-01
# 6 2014-03-05

# [[2]]
        # date
# 1 1975-01-01
# 2 1975-01-05
# 3 1990-10-01
# 4 1990-10-05