找到每年连续六天的第一组的第一天并协调?

时间:2017-07-11 12:59:41

标签: r count statistics

我是R的新手,需要一些建议来解决以下问题:我在数据框中有一个.dbf表,其中某个阈值在不同的空间位置上传递,由&#表示34;点名"在那张桌子里。它看起来像这样:

time    PointID
04/07/71    X10Y11
04/25/71    X10Y11
04/26/71    X10Y11
05/02/71    X10Y11
05/03/71    X10Y11
05/04/71    X10Y11
05/05/71    X10Y11
05/09/71    X10Y11
05/12/71    X10Y11
05/13/71    X10Y11
05/14/71    X10Y11
05/15/71    X10Y11
05/16/71    X10Y11
05/17/71    X10Y11
05/18/71    X10Y11
05/19/71    X10Y11
05/20/71    X10Y11
05/21/71    X10Y11
05/22/71    X10Y11
05/23/71    X10Y11
05/26/71    X10Y11
10/07/71    X10Y11
10/08/71    X10Y11
10/09/71    X10Y11
10/10/71    X10Y11
10/11/71    X10Y11
10/12/71    X10Y11
10/23/71    X10Y11
10/24/71    X10Y11
10/25/71    X10Y11
10/26/71    X10Y11
10/27/71    X10Y11
10/28/71    X10Y11
11/04/71    X10Y11
03/30/72    X10Y11
04/07/72    X10Y11
04/08/72    X10Y11
04/10/72    X10Y11
04/20/72    X10Y11
04/22/72    X10Y11
04/23/72    X10Y11
04/24/72    X10Y11
04/25/72    X10Y11
04/26/72    X10Y11
04/27/72    X10Y11
04/28/72    X10Y11
04/29/72    X10Y11
04/30/72    X10Y11
05/01/72    X10Y11
05/02/72    X10Y11
05/03/72    X10Y11
05/08/72    X10Y11
05/09/72    X10Y11
05/10/72    X10Y11
05/11/72    X10Y11
10/09/72    X10Y11
10/10/72    X10Y11
10/11/72    X10Y11
10/12/72    X10Y11
10/13/72    X10Y11
10/14/72    X10Y11
10/15/72    X10Y11
10/16/72    X10Y11
10/17/72    X10Y11
10/18/72    X10Y11
10/19/72    X10Y11
01/15/73    X10Y11
01/21/73    X10Y11
03/19/73    X10Y11
03/20/73    X10Y11
03/21/73    X10Y11
03/31/73    X10Y11
04/01/73    X10Y11
04/02/73    X10Y11
04/03/73    X10Y11
04/15/73    X10Y11
03/01/71    X10Y12
04/04/71    X10Y12
04/07/71    X10Y12
04/08/71    X10Y12
04/09/71    X10Y12
04/10/71    X10Y12
04/11/71    X10Y12
04/18/71    X10Y12
04/19/71    X10Y12
04/20/71    X10Y12
04/21/71    X10Y12
04/22/71    X10Y12
04/23/71    X10Y12
04/25/71    X10Y12
04/26/71    X10Y12
04/28/71    X10Y12
05/02/71    X10Y12
05/03/71    X10Y12
05/04/71    X10Y12
05/05/71    X10Y12
05/06/71    X10Y12
05/07/71    X10Y12
05/08/71    X10Y12
05/09/71    X10Y12
05/10/71    X10Y12
07/08/71    X10Y12
07/09/71    X10Y12
07/10/71    X10Y12
07/11/71    X10Y12
07/12/71    X10Y12
11/02/71    X10Y12
11/03/71    X10Y12
11/04/71    X10Y12
02/10/72    X10Y12
02/11/72    X10Y12
03/30/72    X10Y12
04/05/72    X10Y12
04/06/72    X10Y12
04/07/72    X10Y12
04/08/72    X10Y12
04/10/72    X10Y12
04/23/72    X10Y12
04/24/72    X10Y12
04/25/72    X10Y12
04/26/72    X10Y12
04/27/72    X10Y12
04/28/72    X10Y12
04/29/72    X10Y12
04/30/72    X10Y12
05/01/72    X10Y12
05/02/72    X10Y12
05/03/72    X10Y12
05/04/72    X10Y12
05/07/72    X10Y12
05/08/72    X10Y12
05/09/72    X10Y12
05/10/72    X10Y12
05/11/72    X10Y12
05/12/72    X10Y12
05/13/72    X10Y12
05/14/72    X10Y12
05/15/72    X10Y12
05/16/72    X10Y12
05/17/72    X10Y12
08/30/72    X10Y12
08/31/72    X10Y12
09/01/72    X10Y12
09/02/72    X10Y12
09/03/72    X10Y12
09/04/72    X10Y12
09/05/72    X10Y12
09/06/72    X10Y12

现在我正在寻找一种方法来找到每年连续六天和PointID的第一组的第一天。理想情况下,上表的结果看起来像这样,只剩下日期和PointID:

time    PointID
05/12/71    X10Y11
04/22/72    X10Y11
04/18/71    X10Y12
04/23/72    X10Y12

该解决方案应该适用于每个表超过7百万行的大型数据集。有谁知道这个问题的解决方案,可以帮助我吗?

谢谢!

编辑:变量如下

'data.frame':   21071 obs. of  2 variables:
 $ time   : Date, format: "1971-03-01" "1971-04-04" "1971-04-04" "1971-04-04" ...
 $ PointID: Factor w/ 5 levels "X10Y11","X10Y12",..: 2 2 3 4 5 1 2 3 5 2 ...

2 个答案:

答案 0 :(得分:1)

我不知道是否足够有效,但这是基础R的可能解决方案:

DF$year <- as.integer(format(DF$time,format='%Y'))

findFirstConsecutiveSixDays <- function(dates){
  dates <- sort(dates)
  RLE <- rle(as.numeric(dates[-length(dates)] - dates[-1]))
  groupOfSixConsec <- which(RLE$values == -1 & RLE$lengths >= 5)
  if(length(groupOfSixConsec) == 0)
    return(as.Date(NA))
  D <- dates[sum(RLE$lengths[1:groupOfSixConsec[1]])-RLE$lengths[groupOfSixConsec[1]]+1]
  return(D)
}

Grouped <- aggregate(time ~ year + PointID, DF, FUN=findFirstConsecutiveSixDays)

> Grouped[complete.cases(Grouped),c('time','PointID')]
        time PointID
1 1971-05-12  X10Y11
2 1972-04-22  X10Y11
4 1971-04-18  X10Y12
5 1972-04-23  X10Y12

复制DF的代码:

DF <- structure(list(time = structure(c(461, 479, 480, 486, 487, 488, 
489, 493, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 
507, 510, 644, 645, 646, 647, 648, 649, 660, 661, 662, 663, 664, 
665, 672, 819, 827, 828, 830, 840, 842, 843, 844, 845, 846, 847, 
848, 849, 850, 851, 852, 853, 858, 859, 860, 861, 1012, 1013, 
1014, 1015, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1110, 1116, 
1173, 1174, 1175, 1185, 1186, 1187, 1188, 1200, 424, 458, 461, 
462, 463, 464, 465, 472, 473, 474, 475, 476, 477, 479, 480, 482, 
486, 487, 488, 489, 490, 491, 492, 493, 494, 553, 554, 555, 556, 
557, 670, 671, 672, 770, 771, 819, 825, 826, 827, 828, 830, 843, 
844, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 857, 858, 
859, 860, 861, 862, 863, 864, 865, 866, 867, 972, 973, 974, 975, 
976, 977, 978, 979), class = "Date"), PointID = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L), .Label = c("X10Y11", "X10Y12"), class = "factor")), .Names = c("time", 
"PointID"), row.names = c(NA, -148L), class = "data.frame")

答案 1 :(得分:0)

以下是使用tidyverse

的想法
library(tidyverse)

df %>% 
  group_by(PointID) %>% 
  mutate(new = c(NA, diff.difftime(time)), new1 = data.table::rleid(new)) %>% 
  filter(new == 1) %>% 
  group_by(PointID, new1) %>% 
  summarise(cnt = n(), time = first(time)) %>% 
  filter(cnt >= 5) %>% 
  mutate(time = time - 1) %>% 
  group_by(PointID, time1 = format(time, format = '%Y')) %>% 
  slice(1L) %>% 
  ungroup() %>% 
  select(-c(new1, cnt, time1))

# A tibble: 4 x 2
#  PointID       time
#   <fctr>     <date>
#1  X10Y11 1971-05-12
#2  X10Y11 1972-04-22
#3  X10Y12 1971-04-18
#4  X10Y12 1972-04-23