我有一个事件的数据框,看起来像这样:
EVENT DATE LONG LAT TYPE
1 1/1/2000 23 45 A
2 2/1/2000 23 45 B
3 3/1/2000 23 45 B
3 5/2/2000 22 56 A
4 6/2/2000 19 21 A
我想折叠它,以便在同一位置(由LONG,LAT定义)连续几天发生的任何事件都会折叠成一个具有START和END日期以及TYPES的连续列的单个事件参与。
因此上表将成为:
EVENT START-DATE END-DATE LONG LAT TYPE
1 1/1/2000 3/1/2000 23 45 ABB
2 5/2/2000 5/2/2000 22 56 A
3 6/2/2000 6/2/2000 19 21 A
如何最好地接近这一点的任何建议将不胜感激。
答案 0 :(得分:6)
这是Ronak Shah解决方案的修改版本,将非连续事件作为单独的事件期间在同一位置进行。
# expanded data sample
df <- data.frame(
DATE = as.Date(c("2000-01-01", "2000-01-02", "2000-01-03", "2000-01-05",
"2000-02-05", "2000-02-06", "2000-02-07"), format = "%Y-%m-%d"),
LONG = c(23, 23, 23, 23, 22, 19, 22),
LAT = c(45, 45, 45, 45, 56, 21, 56),
TYPE = c("A", "B", "B", "A", "A", "B", "A")
)
library(dplyr)
df %>%
group_by(LONG, LAT) %>%
arrange(DATE) %>%
mutate(DATE.diff = c(1, diff(DATE))) %>%
mutate(PERIOD = cumsum(DATE.diff != 1)) %>%
ungroup() %>%
group_by(LONG, LAT, PERIOD) %>%
summarise(START_DATE = min(DATE),
END_DATe = max(DATE),
TYPE = paste(TYPE, collapse = "")) %>%
ungroup()
# A tibble: 5 x 6
LONG LAT PERIOD START_DATE END_DATe TYPE
<dbl> <dbl> <int> <date> <date> <chr>
1 19 21 0 2000-02-06 2000-02-06 B
2 22 56 0 2000-02-05 2000-02-05 A
3 22 56 1 2000-02-07 2000-02-07 A
4 23 45 0 2000-01-01 2000-01-03 ABB
5 23 45 1 2000-01-05 2000-01-05 A
修改,为&#34; PERIOD&#34;添加说明。变量
为简单起见,我们考虑一些连续的连续&amp;我们可以跳过group_by(LONG, LAT)
&amp; arrange(DATE)
步骤:
# sample dataset of 10 events at the same location.
# first 3 are on consecutive days, next 2 are on consecutive days,
# next 4 are on consecutive days, & last 1 is on its own.
df2 <- data.frame(
DATE = as.Date(c("2001-01-01", "2001-01-02", "2001-01-03",
"2001-01-05", "2001-01-06",
"2001-02-01", "2001-02-02", "2001-02-03", "2001-02-04",
"2001-04-01"), format = "%Y-%m-%d"),
LONG = rep(23, 10),
LAT = rep(45, 10),
TYPE = LETTERS[1:10]
)
作为中间步骤,我们创建了一些辅助变量:
&#34; DATE.diff&#34;计算当前行的日期和时间之间的差异。前一行的日期。由于第一行在&#34; 2001-01-01&#34;之前没有日期,我们将差异默认为1.
&#34; non.consecutive&#34;表示所计算的日期差是否不是1(即,不是从前一天开始连续),或1(即从前一天开始连续)。如果您需要在数据集中的同一位置考虑当天的事件,则可以在此处将计算从DATE.diff != 1
更改为DATE.diff > 1
。
&#34; PERIOD&#34;跟踪&#34;非连续&#34;中的TRUE结果的数量。变量。从第一行开始,每当一行与前一行不连续时,&#34; PERIOD&#34;增加1。
作为辅助变量的结果,&#34; PERIOD&#34;为每组连续日期采用不同的值。
df2.intermediate <- df2 %>%
mutate(DATE.diff = c(1, diff(DATE))) %>%
mutate(non.consecutive = DATE.diff != 1) %>%
mutate(PERIOD = cumsum(non.consecutive))
> df2.intermediate
DATE LONG LAT TYPE DATE.diff non.consecutive PERIOD
1 2001-01-01 23 45 A 1 FALSE 0
2 2001-01-02 23 45 B 1 FALSE 0
3 2001-01-03 23 45 C 1 FALSE 0
4 2001-01-05 23 45 D 2 TRUE 1
5 2001-01-06 23 45 E 1 FALSE 1
6 2001-02-01 23 45 F 26 TRUE 2
7 2001-02-02 23 45 G 1 FALSE 2
8 2001-02-03 23 45 H 1 FALSE 2
9 2001-02-04 23 45 I 1 FALSE 2
10 2001-04-01 23 45 J 56 TRUE 3
然后我们可以治疗&#34; PERIOD&#34;作为分组变量,以便找到开始/结束日期&amp;每个时期内的事件:
df2.intermediate %>%
group_by(PERIOD) %>%
summarise(START_DATE = min(DATE),
END_DATe = max(DATE),
TYPE = paste(TYPE, collapse = "")) %>%
ungroup()
# A tibble: 4 x 4
PERIOD START_DATE END_DATe TYPE
<int> <date> <date> <chr>
1 0 2001-01-01 2001-01-03 ABC
2 1 2001-01-05 2001-01-06 DE
3 2 2001-02-01 2001-02-04 FGHI
4 3 2001-04-01 2001-04-01 J
答案 1 :(得分:3)
使用dplyr
,我们可以按LAT
和LONG
进行分组,并为每个组选择最大和最小DATE
,并将TYPE
列粘贴在一起。
library(dplyr)
df %>%
group_by(LONG, LAT) %>%
summarise(start_date = min(as.Date(DATE, "%d/%m/%Y")),
end_date = max(as.Date(DATE, "%d/%m/%Y")),
type = paste0(TYPE, collapse = ""))
# LONG LAT start_date end_date type
# <int> <int> <date> <date> <chr>
#1 19 21 2000-02-06 2000-02-06 A
#2 22 56 2000-02-05 2000-02-05 A
#3 23 45 2000-01-01 2000-01-03 ABB