dplyr中的滚动窗口吗?

时间:2018-07-24 14:29:41

标签: r dplyr rolling-computation

我有一个数据集,其中包含以10分钟为间隔的风速。 我想按月和小时对数据进行分组,然后将数据标记如下:

如果风速小于10 m / s,并且数据点与时间连续小于17个10 m / s的17个其他数据一起位于时间窗口中。

这表示滚动时间窗口,其中包括18个数据点(连续测量3个小时)。

在下面的图中,1左侧的所有数据点均小于10,并且它们与小于10的其他17个连续数据一起位于时间窗口中,这就是为什么它们都用a标记的原因黄旗。

已用2标记的数据点低于10,但是它们不在其他17次连续测量且风速小于10 m / s的时间窗口中,因此,他们没有被标记!

enter image description here

我的样本数据集是:

t= structure(list(TimeStamp = structure(c(1362047400, 1362048000, 
    1362048600, 1362049200, 1362049800, 1362050400, 1362051000, 1362051600, 
    1362052200, 1362052800, 1362053400, 1362054000, 1362054600, 1362055200, 
    1362055800, 1362056400, 1362057000, 1362057600, 1362058200, 1362058800, 
    1362059400, 1362060000, 1362060600, 1362061200, 1362061800, 1362062400, 
    1362063000, 1362063600, 1362064200, 1362064800, 1362065400, 1362066000, 
    1362066600, 1362067200, 1362067800, 1362068400, 1362069000, 1362069600, 
    1362070200, 1362070800, 1362071400, 1362072000, 1362072600, 1362073200, 
    1362073800, 1362074400, 1362075000, 1362075600, 1362076200, 1362076800, 
    1362077400, 1362078000, 1362078600, 1362079200, 1362079800, 1362080400, 
    1362081000, 1362081600, 1362082200, 1362082800, 1362083400, 1362084000, 
    1362084600, 1362085200, 1362085800, 1362086400, 1362087000, 1362087600, 
    1362088200, 1362088800, 1362089400, 1362090000, 1362090600, 1362091200, 
    1362091800, 1362092400, 1362093000, 1362093600, 1362094200, 1362094800, 
    1362095400, 1362096000, 1362096600, 1362097200, 1362097800, 1362098400, 
    1362099000, 1362099600, 1362100200, 1362100800, 1362101400, 1362102000, 
    1362102600, 1362103200, 1362103800, 1362104400, 1362105000, 1362105600, 
    1362106200, 1362106800, 1362107400, 1362108000, 1362108600, 1362109200, 
    1362109800, 1362110400, 1362111000, 1362111600, 1362112200, 1362112800, 
    1362113400, 1362114000, 1362114600, 1362115200, 1362115800, 1362116400, 
    1362117000, 1362117600, 1362118200, 1362118800, 1362119400, 1362120000, 
    1362120600, 1362121200, 1362121800, 1362122400, 1362123000, 1362123600, 
    1362124200, 1362124800, 1362125400, 1362126000, 1362126600, 1362127200, 
    1362127800, 1362128400, 1362129000, 1362129600, 1362130200, 1362130800, 
    1362131400, 1362132000, 1362132600, 1362133200, 1362133800, 1362134400, 
    1362135000, 1362135600, 1362136200, 1362136800, 1362137400, 1362138000, 
    1362138600, 1362139200, 1362139800, 1362140400, 1362141000, 1362141600, 
    1362142200, 1362142800, 1362143400, 1362144000, 1362144600, 1362145200, 
    1362145800, 1362146400, 1362147000, 1362147600, 1362148200, 1362148800, 
    1362149400, 1362150000, 1362150600, 1362151200, 1362151800, 1362152400, 
    1362153000, 1362153600, 1362154200, 1362154800, 1362155400, 1362156000, 
    1362156600, 1362157200, 1362157800, 1362158400, 1362159000, 1362159600, 
    1362160200, 1362160800, 1362161400, 1362162000, 1362162600, 1362163200, 
    1362163800, 1362164400, 1362165000, 1362165600, 1362166200, 1362166800, 
    1362167400), class = c("POSIXct", "POSIXt"), tzone = "GMT"), 
        MeanWindSpeed = c(7.7, 7.6, 6.7, 7.4, 6.6, 6.8, 6.9, 7.3, 
        7.4, 7.8, 7.7, 7.4, 6.5, 6.1, 5.6, 5, 5.8, 6.7, 6.2, 6.6, 
        6.1, 6.4, 5.8, 6.6, 5.9, 6.8, 6.6, 7.1, 7.5, 8, 7.2, 8, 7.2, 
        8.1, 7.7, 7.3, 7.3, 8.1, 7.6, 8.7, 8.1, 9, 8.6, 8.8, 8.8, 
        8.7, 9.1, 9.2, 9.4, 9.8, 9.7, 9.6, 9.7, 10.2, 10.8, 10.9, 
        11.1, 11.6, 11.8, 12.2, 12.5, 12.8, 12.5, 12.3, 11.8, 11.7, 
        11.5, 11.7, 12.1, 12.3, 12.3, 12.9, 13.1, 13.1, 12.6, 12.5, 
        12.6, 12.7, 12.4, 12.3, 12.1, 12.6, 13, 12.7, 13.4, 13.8, 
        13.7, 13.9, 13.8, 13.7, 13.6, 13.7, 13.4, 12.9, 13, 12.6, 
        12.3, 12.3, 12.5, 12.6, 12.9, 12.9, 12.9, 12.9, 12.8, 12.7, 
        12.6, 12.5, 12.6, 12.9, 12.9, 12.8, 12.7, 12.6, 12.8, 12.7, 
        12.6, 12.2, 11.8, 11.4, 11.8, 12.2, 11.7, 11.4, 11.9, 11.3, 
        11.3, 11.1, 11.3, 11.5, 10.6, 9.4, 9.1, 8.5, 8.2, 8, 8, 8.6, 
        8.7, 8.5, 8.4, 8.5, 8.4, 8.5, 7.8, 7.2, 7.3, 7.4, 8.1, 7.9, 
        7.4, 7.2, 7, 6.6, 6.7, 6.7, 6.8, 6.6, 5.9, 5.3, 5.6, 5.9, 
        5.3, 4.6, 3.7, 3.8, 3.7, 3.3, 3.7, 1.9, 2.4, 4.5, 4.6, 3, 
        4.7, 3.9, 3.3, 3.4, 2.9, 4.5, 5.2, 4.3, 4.7, 5.3, 5.3, 5.2, 
        5.7, 4.7, 4.9, 5.3, 5.3, 4.7, 5, 4.7, 6.1, 6.2, 6.6, 6.8, 
        8.4, 9.3, 9.5)), .Names = c("TimeStamp", "MeanWindSpeed"), row.names = 2800:3000, class = "data.frame")

在dplyr软件包中,我以小于10 m / s且大于10的风速进行装箱:

test = t %>%
     dplyr::mutate(H = hour(TimeStamp) )%>%
     dplyr::mutate(M = month(TimeStamp))%>%
     dplyr::group_by(M,H)%>%
     mutate(wsbin = cut(MeanWindSpeed, breaks = c(0,10,30), labels = c(0,1)))

现在我有一个名为wsbin的列,其中包含01值。如何定义一种滚动窗口,该窗口告诉我哪些数据wsbin = 0被风速低于10 m / s的其他17个连续数据包围?

最后,我想得到下表:

enter image description here

这说明每个月的每个小时中,标记数据和总数据的比率是什么。

2 个答案:

答案 0 :(得分:1)

如果您有完整的数据,tibbletime软件包将使此操作相当容易。请注意,此方法假定您的数据集每个小时都有一些数据。如果不这样做,则需要在使用此方法之前估算所有缺少的值。

我试图注释代码本身,以使其相当不言自明。

library(tibbletime)
library(lubridate)

# Turn the sample data into a tibbletime object
tbl <- as_tbl_time(t,TimeStamp)

# Create a function that outputs TRUE if the entire input is less than 10
under10mps <- function(in.vec){
  max(in.vec) < 10
}

# Use the tibbletime package to create a function that rolls on a 3 row window
under10mps3hr <- rollify(under10mps,window = 3)

# Use time based grouping to aggregate times to hourly
tbl2 <- tbl %>%
  # Because rollify works on a 3 row window, we need each hour to be one and only one row.
  collapse_by("hourly",side = "start",clean = TRUE) %>%
  group_by(TimeStamp) %>%
  # Use max windspeed to see to condense data from every 10 minutes to hourly. Other options possible.
  summarise(MeanWindSpeed = max(MeanWindSpeed)) %>%
  ungroup() %>%
  mutate(under10for3 = under10mps3hr(MeanWindSpeed)) %>%
  mutate(month = month(TimeStamp,label = TRUE,abbr = TRUE),
         hour = hour(TimeStamp)) %>%
  group_by(month,hour) %>%
  summarise(prob = sum(under10for3)/length(under10for3)) %>%
  ungroup() %>%
  spread(month,prob)

# # A tibble: 24 x 3
#     hour   Feb   Mar
#     <int> <dbl> <dbl>
# 1     0    NA     0
# 2     1    NA     0
# 3     2    NA     0
# 4     3    NA     0
# 5     4    NA     0
# 6     5    NA     0
# 7     6    NA     0
# 8     7    NA     0
# 9     8    NA     0
# 10    9    NA     0
# # ... with 14 more rows

注意:示例数据在每个月的每个月中只有1个小时的1天数据...因此,仅输入1个小时的概率就是0或1(因为它仅发生过或没有发生过)不会发生一次)。如果您使用整整一个月的数据,则应该获得全方位的概率。

答案 1 :(得分:0)

您可以使用循环来识别10m / s以下三个或三个以上连续小时的序列,然后按月分组并取百分比:

counter <- 0
three_or_more <- c()
for (i in 1:nrow(test)) {

  three_or_more[i] <- 0

  if(test[i, "wsbin"] == 1) { counter <- counter + 1 }

  if(test[i, "wsbin"] == 0) { counter <- 0 }

  if(counter >= 3) {
    three_or_more[(i-2):i] <- 1
  } 

}

test$three_or_more <- three_or_more
test %>% as.data.frame()

# And to get a percent by month:

test %>% group_by(M) %>% summarise(percent_per_month = sum(three_or_more) / n())