我正在根据优先级来重叠日期范围。我在下面举一个例子。下面的代码
df = data.frame(Priority = c("Priority_2","Priority_1", "Priority_2"),
Start = as.Date(c("2018-01-01", "2018-01-03", "2018-01-08")),
End = as.Date(c("2018-01-04","2018-01-05","2018-01-09")))
将为您提供此表:
Priority Start End
Priority_2 2018-01-01 2018-01-04
Priority_1 2018-01-03 2018-01-05
Priority_2 2018-01-08 2018-01-09
我希望能够重叠日期范围,但是要优先考虑。输出表应如下所示:
Priority Start End
Priority_2 2018-01-01 2018-01-02
Priority_1 2018-01-03 2018-01-05
NA 2018-01-06 2018-01-07
Priority_2 2018-01-08 2018-01-09
例如,如果Priority_1与其他任何优先级重叠,则将日期范围指定给Priority_1。否则,如果Priority_2与其他任何优先级重叠,则将日期范围指定给Priority_2。如果无法确定日期范围,则得出NA。如果输出看起来像这样,我也可以:
Priority Start End
Priority_2 2018-01-01 2018-01-03
Priority_1 2018-01-03 2018-01-05
NA 2018-01-05 2018-01-08
Priority_2 2018-01-08 2018-01-09
有人知道如何在R中完成此操作吗?
答案 0 :(得分:1)
这一过程令人惊奇地有趣(如果您喜欢拼图)。我很想看看其他人会提出什么。
让我们仅使用tidyverse进行此操作。特别是dplyr
,tidyr
和purrr
。
library(dplyr) # For dataframe functions
library(tidyr) # For nesting and fill functions
library(purrr) # For map functions
在整洁的框架中,我要做的第一件事是将其转换为更易机读的格式。特别是长数据格式,其中每个日期都由其自己的行表示。没有更多的开始和结束。
要获取开始和结束之间的所有日期,让我们使用map2将序列嵌套在新列中。我嵌套此数据是因为有时只有1个日期,有时只有3个日期,有时是???。通过嵌套,我可以将所有日期包含在数据框的单个变量中。
# Identify the all dates in the range and nest in a new column
df2 <- df %>%
arrange(Start) %>%
mutate(date = map2(Start,End,seq,by = 'day')) %>%
select(-Start,-End)
Priority date 1 Priority_2 17532, 17533, 17534, 17535 2 Priority_1 17534, 17535, 17536 3 Priority_2 17539, 17540
接下来,我要取消嵌套日期行,以便每个日期都是自己的行。
# Unnest the dates column so each item is a row
df2 <- df2 %>%
unnest(date)
Priority date 1 Priority_2 2018-01-01 2 Priority_2 2018-01-02 3 Priority_2 2018-01-03 4 Priority_2 2018-01-04 5 Priority_1 2018-01-03 6 Priority_1 2018-01-04 7 Priority_1 2018-01-05 8 Priority_2 2018-01-08 9 Priority_2 2018-01-09
接下来,让我们按日期分组并汇总以选择最高优先级
# Now we can group_by each date! This means we can summarize to only
# select the highest priority
df2 <- df2 %>%
group_by(date) %>%
# Min finds the lowest string. Priority_1 is lower than Priority_2
summarise(Priority = min(Priority))
# A tibble: 7 x 2 date Priority <date> <chr> 1 2018-01-01 Priority_2 2 2018-01-02 Priority_2 3 2018-01-03 Priority_1 4 2018-01-04 Priority_1 5 2018-01-05 Priority_1 6 2018-01-08 Priority_2 7 2018-01-09 Priority_2
射击!缺少值。一个简单的full_join即可解决日期范围内的所有可能的日期。
# Now for each date in the dataset we have only the highest priority
# but what about the missing values?
df2 <- df2 %>%
# Join in a list of all days in the date range!
full_join(tibble(date = seq(min(df$Start),max(df$End),by='day'))) %>%
arrange(date)
# A tibble: 9 x 2 date Priority <date> <chr> 1 2018-01-01 Priority_2 2 2018-01-02 Priority_2 3 2018-01-03 Priority_1 4 2018-01-04 Priority_1 5 2018-01-05 Priority_1 6 2018-01-06 NA 7 2018-01-07 NA 8 2018-01-08 Priority_2 9 2018-01-09 Priority_2
现在,我们需要弄清楚如何将同一优先级连续多次出现的点归为一组。如果我们将NA
的值转换为"NA"
,我们可以使用lag函数来查看一个值是否与最后出现的值相同。这总是返回第一个值的NA(您看不到第一个值之前发生了什么),因此我们需要解决一个小问题。然后,我们可以使用fill来填充所有空白。
# This is the data desired, but now it needs to be put back in the
# human readable format it started in. Use lag to identify when Priority
# changes, then use fill to establish a group of the same Priority in a row.
df2 <- df2 %>%
# Remove NA from Priority because it causes problems with !=
replace_na(list(Priority = "NA")) %>%
mutate(Group = ifelse(Priority != lag(Priority),1:n(),NA),
# The first column will always be NA...so fix it.
Group = ifelse(is.na(lag(Priority)),1,Group))%>%
# Now that the breaks are identified, fill in the rest of the group
# with the most recent value
fill(Group)
# A tibble: 9 x 3 date Priority Group <date> <chr> <dbl> 1 2018-01-01 Priority_2 1.00 2 2018-01-02 Priority_2 1.00 3 2018-01-03 Priority_1 3.00 4 2018-01-04 Priority_1 3.00 5 2018-01-05 Priority_1 3.00 6 2018-01-06 NA 6.00 7 2018-01-07 NA 6.00 8 2018-01-08 Priority_2 8.00 9 2018-01-09 Priority_2 8.00
从这里开始,可以很快速地按Priority
和Group
列进行分组,进行汇总以获取Start
和End
日期,并清除变量。
# Return the data to human readable form using group_by and summarize
df2 <- df2 %>%
group_by(Priority, Group) %>%
summarise(Start = min(date),
End = max(date)) %>%
ungroup() %>%
# Return "NA" values to NA
mutate(Priority = ifelse(Priority == "NA", NA, Priority)) %>%
arrange(Start) %>%
select(Priority,Start,End)
# A tibble: 4 x 3 Priority Start End <chr> <date> <date> 1 Priority_2 2018-01-01 2018-01-02 2 Priority_1 2018-01-03 2018-01-05 3 NA 2018-01-06 2018-01-07 4 Priority_2 2018-01-08 2018-01-09
或者...整个代码在一个白日梦中
# Performing the whole thing in one go...
df3 <- df %>%
arrange(Start) %>%
mutate(date = map2(Start,End,seq,by = 'day')) %>%
select(-Start,-End) %>%
unnest(date) %>%
group_by(date) %>%
summarise(Priority = min(Priority)) %>%
full_join(tibble(date = seq(min(df$Start),max(df$End),by='day'))) %>%
arrange(date) %>%
replace_na(list(Priority = "NA")) %>%
mutate(Group = ifelse(Priority != lag(Priority),1:n(),NA),
# The first column will always be NA...so fix it.
Group = ifelse(is.na(lag(Priority)),1,Group)) %>%
fill(Group) %>%
group_by(Priority, Group) %>%
summarise(Start = min(date),
End = max(date)) %>%
ungroup() %>%
mutate(Priority = ifelse(Priority == "NA", NA, Priority)) %>%
arrange(Start) %>%
select(Priority,Start,End)