将重叠的日期范围与R中的层次结构合并

时间:2018-08-07 19:33:44

标签: r

我正在根据优先级来重叠日期范围。我在下面举一个例子。下面的代码

df = data.frame(Priority = c("Priority_2","Priority_1", "Priority_2"),
                Start = as.Date(c("2018-01-01", "2018-01-03", "2018-01-08")),
                End = as.Date(c("2018-01-04","2018-01-05","2018-01-09")))

将为您提供此表:

Priority      Start        End
Priority_2    2018-01-01   2018-01-04
Priority_1    2018-01-03   2018-01-05
Priority_2    2018-01-08   2018-01-09

我希望能够重叠日期范围,但是要优先考虑。输出表应如下所示:

Priority      Start        End
Priority_2    2018-01-01   2018-01-02
Priority_1    2018-01-03   2018-01-05
NA            2018-01-06   2018-01-07
Priority_2    2018-01-08   2018-01-09

例如,如果Priority_1与其他任何优先级重叠,则将日期范围指定给Priority_1。否则,如果Priority_2与其他任何优先级重叠,则将日期范围指定给Priority_2。如果无法确定日期范围,则得出NA。如果输出看起来像这样,我也可以:

Priority      Start        End
Priority_2    2018-01-01   2018-01-03
Priority_1    2018-01-03   2018-01-05
NA            2018-01-05   2018-01-08
Priority_2    2018-01-08   2018-01-09

有人知道如何在R中完成此操作吗?

1 个答案:

答案 0 :(得分:1)

这一过程令人惊奇地有趣(如果您喜欢拼图)。我很想看看其他人会提出什么。

让我们仅使用tidyverse进行此操作。特别是dplyrtidyrpurrr

library(dplyr) # For dataframe functions
library(tidyr) # For nesting and fill functions
library(purrr) # For map functions

在整洁的框架中,我要做的第一件事是将其转换为更易机读的格式。特别是长数据格式,其中每个日期都由其自己的行表示。没有更多的开始和结束。

要获取开始和结束之间的所有日期,让我们使用map2将序列嵌套在新列中。我嵌套此数据是因为有时只有1个日期,有时只有3个日期,有时是???。通过嵌套,我可以将所有日期包含在数据框的单个变量中。

# Identify the all dates in the range and nest in a new column
df2 <- df %>%
  arrange(Start) %>%
  mutate(date = map2(Start,End,seq,by = 'day')) %>%
  select(-Start,-End)
    Priority                       date
1 Priority_2 17532, 17533, 17534, 17535
2 Priority_1        17534, 17535, 17536
3 Priority_2               17539, 17540

接下来,我要取消嵌套日期行,以便每个日期都是自己的行。

# Unnest the dates column so each item is a row
df2 <- df2 %>%
  unnest(date)
    Priority       date
1 Priority_2 2018-01-01
2 Priority_2 2018-01-02
3 Priority_2 2018-01-03
4 Priority_2 2018-01-04
5 Priority_1 2018-01-03
6 Priority_1 2018-01-04
7 Priority_1 2018-01-05
8 Priority_2 2018-01-08
9 Priority_2 2018-01-09

接下来,让我们按日期分组并汇总以选择最高优先级

# Now we can group_by each date! This means we can summarize to only
# select the highest priority
df2 <- df2 %>%
  group_by(date) %>%
  # Min finds the lowest string. Priority_1 is lower than Priority_2 
  summarise(Priority = min(Priority)) 
# A tibble: 7 x 2
  date       Priority  
  <date>     <chr>     
1 2018-01-01 Priority_2
2 2018-01-02 Priority_2
3 2018-01-03 Priority_1
4 2018-01-04 Priority_1
5 2018-01-05 Priority_1
6 2018-01-08 Priority_2
7 2018-01-09 Priority_2

射击!缺少值。一个简单的full_join即可解决日期范围内的所有可能的日期。

# Now for each date in the dataset we have only the highest priority
# but what about the missing values?
df2 <- df2 %>%
  # Join in a list of all days in the date range!
  full_join(tibble(date = seq(min(df$Start),max(df$End),by='day'))) %>%
  arrange(date)
# A tibble: 9 x 2
  date       Priority  
  <date>     <chr>     
1 2018-01-01 Priority_2
2 2018-01-02 Priority_2
3 2018-01-03 Priority_1
4 2018-01-04 Priority_1
5 2018-01-05 Priority_1
6 2018-01-06 NA        
7 2018-01-07 NA        
8 2018-01-08 Priority_2
9 2018-01-09 Priority_2

现在,我们需要弄清楚如何将同一优先级连续多次出现的点归为一组。如果我们将NA的值转换为"NA",我们可以使用lag函数来查看一个值是否与最后出现的值相同。这总是返回第一个值的NA(您看不到第一个值之前发生了什么),因此我们需要解决一个小问题。然后,我们可以使用fill来填充所有空白。

# This is the data desired, but now it needs to be put back in the 
# human readable format it started in. Use lag to identify when Priority
# changes, then use fill to establish a group of the same Priority in a row.
df2 <- df2 %>%
  # Remove NA from Priority because it causes problems with !=
  replace_na(list(Priority = "NA")) %>%
  mutate(Group = ifelse(Priority != lag(Priority),1:n(),NA),
         # The first column will always be NA...so fix it.
         Group = ifelse(is.na(lag(Priority)),1,Group))%>%
  # Now that the breaks are identified, fill in the rest of the group 
  # with the most recent value
  fill(Group)
# A tibble: 9 x 3
  date       Priority   Group
  <date>     <chr>      <dbl>
1 2018-01-01 Priority_2  1.00
2 2018-01-02 Priority_2  1.00
3 2018-01-03 Priority_1  3.00
4 2018-01-04 Priority_1  3.00
5 2018-01-05 Priority_1  3.00
6 2018-01-06 NA          6.00
7 2018-01-07 NA          6.00
8 2018-01-08 Priority_2  8.00
9 2018-01-09 Priority_2  8.00

从这里开始,可以很快速地按PriorityGroup列进行分组,进行汇总以获取StartEnd日期,并清除变量。

# Return the data to human readable form using group_by and summarize
df2 <- df2 %>%
  group_by(Priority, Group) %>%
  summarise(Start = min(date),
            End = max(date)) %>%
  ungroup() %>%
  # Return "NA" values to NA
  mutate(Priority = ifelse(Priority == "NA", NA, Priority)) %>%
  arrange(Start) %>%
  select(Priority,Start,End)
# A tibble: 4 x 3
  Priority   Start      End       
  <chr>      <date>     <date>    
1 Priority_2 2018-01-01 2018-01-02
2 Priority_1 2018-01-03 2018-01-05
3 NA         2018-01-06 2018-01-07
4 Priority_2 2018-01-08 2018-01-09

或者...整个代码在一个白日梦中

# Performing the whole thing in one go...
df3 <- df %>%
  arrange(Start) %>%
  mutate(date = map2(Start,End,seq,by = 'day')) %>%
  select(-Start,-End) %>%
  unnest(date) %>%
  group_by(date) %>%
  summarise(Priority = min(Priority))  %>%
  full_join(tibble(date = seq(min(df$Start),max(df$End),by='day'))) %>%
  arrange(date) %>%
  replace_na(list(Priority = "NA")) %>%
  mutate(Group = ifelse(Priority != lag(Priority),1:n(),NA),
         # The first column will always be NA...so fix it.
         Group = ifelse(is.na(lag(Priority)),1,Group)) %>%
  fill(Group) %>%
  group_by(Priority, Group) %>%
  summarise(Start = min(date),
            End = max(date)) %>%
  ungroup() %>%
  mutate(Priority = ifelse(Priority == "NA", NA, Priority)) %>%
  arrange(Start) %>%
  select(Priority,Start,End)