我是R的新手,需要一些建议来解决以下问题:我在数据框中有一个.dbf表,其中某个阈值在不同的空间位置上传递,由&#表示34;点名"在那张桌子里。它看起来像这样:
time PointID
04/07/71 X10Y11
04/25/71 X10Y11
04/26/71 X10Y11
05/02/71 X10Y11
05/03/71 X10Y11
05/04/71 X10Y11
05/05/71 X10Y11
05/09/71 X10Y11
05/12/71 X10Y11
05/13/71 X10Y11
05/14/71 X10Y11
05/15/71 X10Y11
05/16/71 X10Y11
05/17/71 X10Y11
05/18/71 X10Y11
05/19/71 X10Y11
05/20/71 X10Y11
05/21/71 X10Y11
05/22/71 X10Y11
05/23/71 X10Y11
05/26/71 X10Y11
10/07/71 X10Y11
10/08/71 X10Y11
10/09/71 X10Y11
10/10/71 X10Y11
10/11/71 X10Y11
10/12/71 X10Y11
10/23/71 X10Y11
10/24/71 X10Y11
10/25/71 X10Y11
10/26/71 X10Y11
10/27/71 X10Y11
10/28/71 X10Y11
11/04/71 X10Y11
03/30/72 X10Y11
04/07/72 X10Y11
04/08/72 X10Y11
04/10/72 X10Y11
04/20/72 X10Y11
04/22/72 X10Y11
04/23/72 X10Y11
04/24/72 X10Y11
04/25/72 X10Y11
04/26/72 X10Y11
04/27/72 X10Y11
04/28/72 X10Y11
04/29/72 X10Y11
04/30/72 X10Y11
05/01/72 X10Y11
05/02/72 X10Y11
05/03/72 X10Y11
05/08/72 X10Y11
05/09/72 X10Y11
05/10/72 X10Y11
05/11/72 X10Y11
10/09/72 X10Y11
10/10/72 X10Y11
10/11/72 X10Y11
10/12/72 X10Y11
10/13/72 X10Y11
10/14/72 X10Y11
10/15/72 X10Y11
10/16/72 X10Y11
10/17/72 X10Y11
10/18/72 X10Y11
10/19/72 X10Y11
01/15/73 X10Y11
01/21/73 X10Y11
03/19/73 X10Y11
03/20/73 X10Y11
03/21/73 X10Y11
03/31/73 X10Y11
04/01/73 X10Y11
04/02/73 X10Y11
04/03/73 X10Y11
04/15/73 X10Y11
03/01/71 X10Y12
04/04/71 X10Y12
04/07/71 X10Y12
04/08/71 X10Y12
04/09/71 X10Y12
04/10/71 X10Y12
04/11/71 X10Y12
04/18/71 X10Y12
04/19/71 X10Y12
04/20/71 X10Y12
04/21/71 X10Y12
04/22/71 X10Y12
04/23/71 X10Y12
04/25/71 X10Y12
04/26/71 X10Y12
04/28/71 X10Y12
05/02/71 X10Y12
05/03/71 X10Y12
05/04/71 X10Y12
05/05/71 X10Y12
05/06/71 X10Y12
05/07/71 X10Y12
05/08/71 X10Y12
05/09/71 X10Y12
05/10/71 X10Y12
07/08/71 X10Y12
07/09/71 X10Y12
07/10/71 X10Y12
07/11/71 X10Y12
07/12/71 X10Y12
11/02/71 X10Y12
11/03/71 X10Y12
11/04/71 X10Y12
02/10/72 X10Y12
02/11/72 X10Y12
03/30/72 X10Y12
04/05/72 X10Y12
04/06/72 X10Y12
04/07/72 X10Y12
04/08/72 X10Y12
04/10/72 X10Y12
04/23/72 X10Y12
04/24/72 X10Y12
04/25/72 X10Y12
04/26/72 X10Y12
04/27/72 X10Y12
04/28/72 X10Y12
04/29/72 X10Y12
04/30/72 X10Y12
05/01/72 X10Y12
05/02/72 X10Y12
05/03/72 X10Y12
05/04/72 X10Y12
05/07/72 X10Y12
05/08/72 X10Y12
05/09/72 X10Y12
05/10/72 X10Y12
05/11/72 X10Y12
05/12/72 X10Y12
05/13/72 X10Y12
05/14/72 X10Y12
05/15/72 X10Y12
05/16/72 X10Y12
05/17/72 X10Y12
08/30/72 X10Y12
08/31/72 X10Y12
09/01/72 X10Y12
09/02/72 X10Y12
09/03/72 X10Y12
09/04/72 X10Y12
09/05/72 X10Y12
09/06/72 X10Y12
现在我正在寻找一种方法来找到每年连续六天和PointID的第一组的第一天。理想情况下,上表的结果看起来像这样,只剩下日期和PointID:
time PointID
05/12/71 X10Y11
04/22/72 X10Y11
04/18/71 X10Y12
04/23/72 X10Y12
该解决方案应该适用于每个表超过7百万行的大型数据集。有谁知道这个问题的解决方案,可以帮助我吗?
谢谢!
编辑:变量如下
'data.frame': 21071 obs. of 2 variables:
$ time : Date, format: "1971-03-01" "1971-04-04" "1971-04-04" "1971-04-04" ...
$ PointID: Factor w/ 5 levels "X10Y11","X10Y12",..: 2 2 3 4 5 1 2 3 5 2 ...
答案 0 :(得分:1)
我不知道是否足够有效,但这是基础R的可能解决方案:
DF$year <- as.integer(format(DF$time,format='%Y'))
findFirstConsecutiveSixDays <- function(dates){
dates <- sort(dates)
RLE <- rle(as.numeric(dates[-length(dates)] - dates[-1]))
groupOfSixConsec <- which(RLE$values == -1 & RLE$lengths >= 5)
if(length(groupOfSixConsec) == 0)
return(as.Date(NA))
D <- dates[sum(RLE$lengths[1:groupOfSixConsec[1]])-RLE$lengths[groupOfSixConsec[1]]+1]
return(D)
}
Grouped <- aggregate(time ~ year + PointID, DF, FUN=findFirstConsecutiveSixDays)
> Grouped[complete.cases(Grouped),c('time','PointID')]
time PointID
1 1971-05-12 X10Y11
2 1972-04-22 X10Y11
4 1971-04-18 X10Y12
5 1972-04-23 X10Y12
复制DF的代码:
DF <- structure(list(time = structure(c(461, 479, 480, 486, 487, 488,
489, 493, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506,
507, 510, 644, 645, 646, 647, 648, 649, 660, 661, 662, 663, 664,
665, 672, 819, 827, 828, 830, 840, 842, 843, 844, 845, 846, 847,
848, 849, 850, 851, 852, 853, 858, 859, 860, 861, 1012, 1013,
1014, 1015, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1110, 1116,
1173, 1174, 1175, 1185, 1186, 1187, 1188, 1200, 424, 458, 461,
462, 463, 464, 465, 472, 473, 474, 475, 476, 477, 479, 480, 482,
486, 487, 488, 489, 490, 491, 492, 493, 494, 553, 554, 555, 556,
557, 670, 671, 672, 770, 771, 819, 825, 826, 827, 828, 830, 843,
844, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 857, 858,
859, 860, 861, 862, 863, 864, 865, 866, 867, 972, 973, 974, 975,
976, 977, 978, 979), class = "Date"), PointID = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("X10Y11", "X10Y12"), class = "factor")), .Names = c("time",
"PointID"), row.names = c(NA, -148L), class = "data.frame")
答案 1 :(得分:0)
以下是使用tidyverse
,
library(tidyverse)
df %>%
group_by(PointID) %>%
mutate(new = c(NA, diff.difftime(time)), new1 = data.table::rleid(new)) %>%
filter(new == 1) %>%
group_by(PointID, new1) %>%
summarise(cnt = n(), time = first(time)) %>%
filter(cnt >= 5) %>%
mutate(time = time - 1) %>%
group_by(PointID, time1 = format(time, format = '%Y')) %>%
slice(1L) %>%
ungroup() %>%
select(-c(new1, cnt, time1))
# A tibble: 4 x 2
# PointID time
# <fctr> <date>
#1 X10Y11 1971-05-12
#2 X10Y11 1972-04-22
#3 X10Y12 1971-04-18
#4 X10Y12 1972-04-23