Question

我试图找到合适的答案，但所有案例都比我的案例简单得多。我需要根据我拥有的数据框中的日期信息创建一个4级（nov，end_feb，end_apr等）因子，然后将其添加为列。而且，因为我拥有的实际df超过80万行，所以我需要代码才能快速运行

到目前为止，这是我对lubridate和%within%的理解。它确实可以工作，但是由于效率低下而导致速度非常慢，因为我不得不诉诸于sapply(df, sub_period_gen(date))创建新列。理想情况下，我需要一种方法来确保解决方案是矢量化的，因为我还有其他一些在相同数据帧上运行并且运行缓慢的因子生成器

sub_period_gen <- function(x){
  i_1 <- ymd("2019-11-01")%--% ymd("2019-11-30")
  i_2 <- ymd("2020-02-24")%--% ymd("2020-02-29")
  i_3 <- ymd("2020-04-24")%--% ymd("2020-04-30")
  if (x %within% i_1){
    return("nov")  # return case one
  } else if (x %within% i_2){
    return("end_feb")  # return case two
  } else if (x %within% i_3){
    return("end_apr")  # return case three
  } else{
    return("other")  # return case four
  }
}

谢谢！

编辑：我对解决方案进行了一些优化，但看起来仍然不是最佳而且很难修改。另外，我将间隔移到了全球环境中

sub_period_gen <- function(x){
  return(ifelse(x %within% i_1,"nov",ifelse(x %within% i_2,"end_feb",ifelse(x %within% i_3,"end_apr","other"))))
  }

我的问题与this one不同，因为我的约会日期确实没有规律，休息时间是针对特定的分析。

编辑2：示例输入：

library(lubridate)
toy <- tibble(date = ymd("2019-11-12","2020-03-11","2020-01-31","2019-12-19","2019-12-04","2020-01-21","2020-01-31","2020-02-16",
              "2020-02-28","2020-03-20","2020-02-08","2020-03-23","2020-01-22","2020-02-18","2020-03-19","2019-11-22",
              "2020-01-14","2020-03-04","2019-12-02","2019-11-03","2020-02-27","2020-02-13","2019-11-17","2020-03-17",
              "2020-04-14","2019-12-19","2019-11-05","2020-01-11","2020-04-25","2019-11-24"))

所需的输出：

>  date         sub_period
>   <date>     <chr>     
> 1 2019-11-12 nov       
> 2 2020-03-11 other
> 3 2020-01-31 other   
> 4 2019-12-19 other   
> 5 2019-12-04 other   
> 6 2020-01-21 other   
> 7 2020-02-29 end_feb   
> 8 2020-02-16 other   
> 9 2020-04-28 end_apr

Answer 1

这是case_when中使用dplyr的一种方法：

library(dplyr)
library(lubridate)
toy %>%
  mutate(sub_period = 
         case_when(date >= ymd("2019-11-01") & date < ymd("2019-11-30") ~ "nov",
                   date >= ymd("2020-02-24") & date < ymd("2020-02-29") ~ "end_feb",
                   date >= ymd("2020-04-24") & date < ymd("2020-04-30") ~ "end_apr",
                   TRUE ~ "other"))
# A tibble: 30 x 2
   date       sub_period
   <date>     <chr>     
 1 2019-11-12 nov       
 2 2020-03-11 other     
 3 2020-01-31 other     
 4 2019-12-19 other     
 5 2019-12-04 other     
 6 2020-01-21 other     
 7 2020-01-31 other     
 8 2020-02-16 other     
 9 2020-02-28 end_feb   
10 2020-03-20 other     
# … with 20 more rows

如果您需要更高的速度，可以对data.table的{{1}}类进行非等额联接。首先，您需要设置一个单独的表以连接到：

IDate

然后执行加入：

library(data.table)
setDT(toy)
toy[,date:=as.IDate(date)]

date.table <- data.table(Start = c(as.IDate("2019-11-01"),as.IDate("2020-02-24"),as.IDate("2020-04-24")),
                         End = c(as.IDate("2019-11-30"),as.IDate("2020-02-29"),as.IDate("2020-04-30")),
                         sub_period = c("nov","end_feb","end_apr"))

date.table
        Start        End sub_period
1: 2019-11-01 2019-11-30        nov
2: 2020-02-24 2020-02-29    end_feb
3: 2020-04-24 2020-04-30    end_apr

Answer 2

在基数R中，您可以像这样使用嵌套的ifelse函数：

sub_period_gen <- function(x){
ifelse(x >= ymd("2019-11-01") & x <= ymd("2019-11-30"), "nov",
ifelse(x >= ymd("2020-02-24") & x <= ymd("2020-02-29"), "end_feb",
ifelse(x >= ymd("2020-04-24") & x <= ymd("2020-04-30"), "end_apr",
"other")))
}

要获得所需的输出，您可以像这样cbind.data.frame(toy,sub_period= sub_period_gen(toy$date))绑定输入和输出。

使用lubridate根据日期创建因子

2 个答案: