我有一个包含开始日期和结束日期的数据集,我想根据期间中的年份拆分此数据框中的行。以此数据框为例:
df <- data.frame("starting_date"=as.Date("2015-06-01"),"end_date"=as.Date("2017-09-30"))
应分为3行,一行的开始日期为2015-06-01,结束日期为2015-12-31,一行的开始日期为2016-01-01,结束日期为2016-12-31,另一行为开始日期2017-01-01和结束日期2017-09-30。知道怎么做吗?它最终应该是这样的:
starting_date end_date
1 2015-06-01 2015-12-31
2 2016-01-01 2016-12-31
3 2017-01-01 2017-09-30
编辑:我已将代码调整为在基础R中工作。
EDIT2:我试过了
library(dplyr)
df2 <- df[1,]
df2 <- df[-1,]
for (i in 1:dim(df)[1]){
for (j in year(df$starting_date[i]):year(df$end_date[i]))
{
df2 <- bind_rows(df2,df[i,])
}
}
它有效,但速度很慢。
EDIT3: 我设法复制了与所涉及的年数相等的行:
df2 <- df[rep(seq_len(nrow(df)),year(df$end_date)-year(df$starting_date)+1),]
现在我需要另一个列有这样年份的专栏:
starting_date end_date years
1 2015-06-01 2017-09-30 2015
2 2015-06-01 2017-09-30 2016
3 2015-06-01 2017-09-30 2017
一旦我在这里,很容易得到所需的最终结果....有关如何做到这一点的任何想法? 我尝试用多年来制作一个单独的矢量,以便用df2来解决它,但它没有用....
years <- lapply(df,function(x) seq(x[,"starting_date"],length.out=x[,"year"]))
EDIT4: 最后在这篇文章的帮助下设法做到了:R Create a time sequence as xts index based on two columns in data.frame 代码可能会大量改进,但它可以工作....
diffs <- abs(with(df, year(starting_date)-year(end_date)))+1
df.rep <- df[rep(1:nrow(df), times=diffs), ]
reps <- rep(diffs, times=diffs)
dates.l <- apply(
df[colnames(df) %in% c("starting_date", "end_date")], 1,
function(x) {
seq(min(year(as.Date(x))), max(year(as.Date(x))))
})
years <- do.call(c, dates.l)
df.long <- cbind(df.rep, reps, years)
df.long$yearstart <- as.Date(paste0(year(df.long$years),"-01-01"))
df.long$yearend <- as.Date(paste0(year(df.long$years),"-12-31"))
df.long$starting_date2 <- pmax(df.long$starting_date,df.long$yearstart)
df.long$end_date2 <- pmin(df.long$end_date,df.long$yearend)
答案 0 :(得分:1)
另一种方法可能是
library(dplyr)
library(lubridate)
#sample data
df <- data.frame("starting_date" = as.Date(c("2015-06-01", "2013-06-01", "2016-02-11")),
"end_date" = as.Date(c("2017-09-30", "2017-11-11", "2017-01-01")),
col3=c('AAA','BBB', 'CCC'),
col4=c('33445454','565664', '123'))
df1 <- df[,1:2] %>%
rowwise() %>%
do(rbind(data.frame(matrix(as.character(c(
.$starting_date,
seq(.$starting_date, .$end_date, by=1)[grep("\\d{4}-12-31|\\d{4}-01-01", seq(.$starting_date, .$end_date, by=1))],
.$end_date)), ncol=2, byrow=T)))) %>%
data.frame() %>%
`colnames<-`(c("starting_date", "end_date")) %>%
mutate(starting_date= as.Date(starting_date, format= "%Y-%m-%d"),
end_date= as.Date(end_date, format= "%Y-%m-%d"))
#add temporary columns to the original and expanded date column dataframes
df$row_idx <- seq(1:nrow(df))
df$temp_col <- (year(df$end_date) - year(df$starting_date)) +1
df1 <- cbind(df1,row_idx = rep(df$row_idx,df$temp_col))
#join both dataframes to get the final result
final_df <- left_join(df1,df[,3:(ncol(df)-1)],by="row_idx") %>%
select(-row_idx)
final_df
输出是:
starting_date end_date col3 col4
1 2015-06-01 2015-12-31 AAA 33445454
2 2016-01-01 2016-12-31 AAA 33445454
3 2017-01-01 2017-09-30 AAA 33445454
4 2013-06-01 2013-12-31 BBB 565664
5 2014-01-01 2014-12-31 BBB 565664
6 2015-01-01 2015-12-31 BBB 565664
7 2016-01-01 2016-12-31 BBB 565664
8 2017-01-01 2017-11-11 BBB 565664
9 2016-02-11 2016-12-31 CCC 123
10 2017-01-01 2017-01-01 CCC 123