将具有连续日期的行组合为具有开始日期和结束日期的单行

时间:2017-09-01 04:50:14

标签: r

我有一个事件的数据框,看起来像这样:

EVENT     DATE       LONG    LAT    TYPE     
1         1/1/2000   23      45     A
2         2/1/2000   23      45     B
3         3/1/2000   23      45     B
3         5/2/2000   22      56     A
4         6/2/2000   19      21     A

我想折叠它,以便在同一位置(由LONG,LAT定义)连续几天发生的任何事件都会折叠成一个具有START和END日期以及TYPES的连续列的单个事件参与。

因此上表将成为:

EVENT     START-DATE    END-DATE    LONG    LAT    TYPE     
1         1/1/2000      3/1/2000    23      45     ABB
2         5/2/2000      5/2/2000    22      56     A
3         6/2/2000      6/2/2000    19      21     A

如何最好地接近这一点的任何建议将不胜感激。

2 个答案:

答案 0 :(得分:6)

这是Ronak Shah解决方案的修改版本,将非连续事件作为单独的事件期间在同一位置进行。

# expanded data sample
df <- data.frame(
  DATE = as.Date(c("2000-01-01", "2000-01-02", "2000-01-03", "2000-01-05",
                   "2000-02-05", "2000-02-06", "2000-02-07"), format = "%Y-%m-%d"),
  LONG = c(23, 23, 23, 23, 22, 19, 22),
  LAT = c(45, 45, 45, 45, 56, 21, 56),
  TYPE = c("A", "B", "B", "A", "A", "B", "A")
)

library(dplyr)

df %>%
  group_by(LONG, LAT) %>%
  arrange(DATE) %>%
  mutate(DATE.diff = c(1, diff(DATE))) %>%
  mutate(PERIOD = cumsum(DATE.diff != 1)) %>%
  ungroup() %>%
  group_by(LONG, LAT, PERIOD) %>%
  summarise(START_DATE = min(DATE),
            END_DATe = max(DATE), 
            TYPE = paste(TYPE, collapse = "")) %>%
  ungroup()

# A tibble: 5 x 6
   LONG   LAT PERIOD START_DATE   END_DATe  TYPE
  <dbl> <dbl>  <int>     <date>     <date> <chr>
1    19    21      0 2000-02-06 2000-02-06     B
2    22    56      0 2000-02-05 2000-02-05     A
3    22    56      1 2000-02-07 2000-02-07     A
4    23    45      0 2000-01-01 2000-01-03   ABB
5    23    45      1 2000-01-05 2000-01-05     A

修改,为&#34; PERIOD&#34;添加说明。变量

为简单起见,我们考虑一些连续的连续&amp;我们可以跳过group_by(LONG, LAT)&amp; arrange(DATE)步骤:

# sample dataset of 10 events at the same location. 
# first 3 are on consecutive days, next 2 are on consecutive days,
# next 4 are on consecutive days, & last 1 is on its own.
df2 <- data.frame(
  DATE = as.Date(c("2001-01-01", "2001-01-02", "2001-01-03", 
                   "2001-01-05", "2001-01-06",
                   "2001-02-01", "2001-02-02", "2001-02-03", "2001-02-04",
                   "2001-04-01"), format = "%Y-%m-%d"),
  LONG = rep(23, 10),
  LAT = rep(45, 10),
  TYPE = LETTERS[1:10]
)

作为中间步骤,我们创建了一些辅助变量:

  1. &#34; DATE.diff&#34;计算当前行的日期和时间之间的差异。前一行的日期。由于第一行在&#34; 2001-01-01&#34;之前没有日期,我们将差异默认为1.

  2. &#34; non.consecutive&#34;表示所计算的日期差是否不是1(即,不是从前一天开始连续),或1(即从前一天开始连续)。如果您需要在数据集中的同一位置考虑当天的事件,则可以在此处将计算从DATE.diff != 1更改为DATE.diff > 1

  3. &#34; PERIOD&#34;跟踪&#34;非连续&#34;中的TRUE结果的数量。变量。从第一行开始,每当一行与前一行不连续时,&#34; PERIOD&#34;增加1。

  4. 作为辅助变量的结果,&#34; PERIOD&#34;为每组连续日期采用不同的值。

    df2.intermediate <- df2 %>%
      mutate(DATE.diff = c(1, diff(DATE))) %>%
      mutate(non.consecutive = DATE.diff != 1) %>%
      mutate(PERIOD = cumsum(non.consecutive))
    
    > df2.intermediate
             DATE LONG LAT TYPE DATE.diff non.consecutive PERIOD
    1  2001-01-01   23  45    A         1           FALSE      0
    2  2001-01-02   23  45    B         1           FALSE      0
    3  2001-01-03   23  45    C         1           FALSE      0
    4  2001-01-05   23  45    D         2            TRUE      1
    5  2001-01-06   23  45    E         1           FALSE      1
    6  2001-02-01   23  45    F        26            TRUE      2
    7  2001-02-02   23  45    G         1           FALSE      2
    8  2001-02-03   23  45    H         1           FALSE      2
    9  2001-02-04   23  45    I         1           FALSE      2
    10 2001-04-01   23  45    J        56            TRUE      3
    

    然后我们可以治疗&#34; PERIOD&#34;作为分组变量,以便找到开始/结束日期&amp;每个时期内的事件:

    df2.intermediate %>%
      group_by(PERIOD) %>%
      summarise(START_DATE = min(DATE),
                END_DATe = max(DATE), 
                TYPE = paste(TYPE, collapse = "")) %>%
      ungroup()
    
    # A tibble: 4 x 4
      PERIOD START_DATE   END_DATe  TYPE
       <int>     <date>     <date> <chr>
    1      0 2001-01-01 2001-01-03   ABC
    2      1 2001-01-05 2001-01-06    DE
    3      2 2001-02-01 2001-02-04  FGHI
    4      3 2001-04-01 2001-04-01     J
    

答案 1 :(得分:3)

使用dplyr,我们可以按LATLONG进行分组,并为每个组选择最大和最小DATE,并将TYPE列粘贴在一起。

library(dplyr)
df %>%
   group_by(LONG, LAT) %>%
   summarise(start_date = min(as.Date(DATE, "%d/%m/%Y")), 
             end_date = max(as.Date(DATE, "%d/%m/%Y")), 
             type = paste0(TYPE, collapse = ""))



#   LONG   LAT start_date   end_date  type
#  <int> <int>     <date>     <date> <chr>
#1    19    21 2000-02-06 2000-02-06     A
#2    22    56 2000-02-05 2000-02-05     A
#3    23    45 2000-01-01 2000-01-03   ABB