案例:
给定了一个数据帧df
,其中包括startDate
和一个endDate
列。我的目标是将df
“转换”为新数据框df_res
,该数据框包含间隔(startDate, endDate)
中每个月出现的一行,并附加year
,{{1 }}和month
,均为int类型。
示例:
输入: numberOfDaysInMonth
df
所需的输出: id startDate endDate someOtherCol
1 2017-09-23 2018-02-01 val1
2 2018-01-01 2018-03-31 val2
... ... ... ...
df_res
背景:
我对R比较陌生,但是知道不错的 id year month numberOfDaysInMonth someOtherCol
1 2017 9 8 val1
1 2017 10 31 val1
1 2017 11 30 val1
1 2017 12 31 val1
1 2018 1 31 val1
1 2018 2 1 val1
2 2018 1 31 val2
2 2018 2 28 val2
2 2018 3 31 val2
... ... ... ... ...
和dplyr
软件包。即使使用这些软件包,我也无法以一种简洁的方式实现上述目标。我得到的最接近的是:Expand rows by date range using start and end date,但这并不能得出该范围内每月包含的天数。
非常感谢您的帮助。
答案 0 :(得分:2)
如果您不介意使用data.table
解决方案,则可以在按ID,someOtherCol,年份和月份进行汇总之前,在startDate和endDate之间创建一系列连续的日期,如下所示:
dat[, .(Dates=seq(startDate, endDate, by="1 day")), by=.(id, someOtherCol)][,
.N, by=.(id, someOtherCol, year(Dates), month(Dates))]
输出:
id someOtherCol year month N
1: 1 val1 2017 9 8
2: 1 val1 2017 10 31
3: 1 val1 2017 11 30
4: 1 val1 2017 12 31
5: 1 val1 2018 1 31
6: 1 val1 2018 2 1
7: 2 val2 2018 1 31
8: 2 val2 2018 2 28
9: 2 val2 2018 3 31
数据:
library(data.table)
dat <- fread("id startDate endDate someOtherCol
1 2017-09-23 2018-02-01 val1
2 2018-01-01 2018-03-31 val2")
datecols <- c("startDate", "endDate")
dat[, (datecols) := lapply(.SD, as.Date, format="%Y-%m-%d"), .SDcols=datecols]
答案 1 :(得分:1)
一种tidyverse
解决方案:
# example data
df = read.table(text = "
id startDate endDate someOtherCol
1 2017-09-23 2018-02-01 val1
2 2018-01-01 2018-03-31 val2
", header=T, stringsAsFactors=F)
library(tidyverse)
library(lubridate)
df %>%
mutate_at(vars(startDate, endDate), ymd) %>% # update to date columns (if needed)
group_by(id) %>% # for each id
mutate(d = list(seq(startDate, endDate, by="1 day"))) %>% # create a sequence of dates (as a list)
unnest() %>% # unnest data
group_by(id, year=year(d), month=month(d), someOtherCol) %>% # group by those variables (while getting year and month of each date in the sequence)
summarise(numberOfDaysInMonth = n()) %>% # count days
ungroup() # forget the grouping
# # A tibble: 9 x 5
# id year month someOtherCol numberOfDaysInMonth
# <int> <dbl> <dbl> <chr> <int>
# 1 1 2017 9 val1 8
# 2 1 2017 10 val1 31
# 3 1 2017 11 val1 30
# 4 1 2017 12 val1 31
# 5 1 2018 1 val1 31
# 6 1 2018 2 val1 1
# 7 2 2018 1 val2 31
# 8 2 2018 2 val2 28
# 9 2 2018 3 val2 31