我有一个关于数据操作的简单问题。给出以下数据集:
n = c("john","jane","tim","john","jimmy","tim","jane","john","jimmy")
s = c("2012-03-21","2013-02-12","2014-01-01","2012-05-21","2010-12-17","2012-01-21","2013-03-12","2013-08-21","2010-09-17")
df = data.frame(n,s)
n s
1 john 2012-03-21
2 jane 2013-02-12
3 tim 2014-01-01
4 john 2012-05-21
5 jimmy 2010-12-17
6 tim 2012-01-21
7 jane 2013-03-12
8 john 2013-08-21
9 jimmy 2010-09-17
我想创建第三列数据,对于每个人,我已经计算了从最早时间点开始的月数。它看起来如下:
n s output
1 john 2012-03-21 0
2 jane 2013-02-12 0
3 tim 2014-01-01 24
4 john 2012-05-21 2
5 jimmy 2010-12-17 3
6 tim 2012-01-21 0
7 jane 2013-03-12 1
8 john 2013-08-21 17
9 jimmy 2010-09-17 0
正如您所看到的,以约翰为例,最早的时间点是2012-03-21,因此它计算了2012-03-21至2012-05-21,然后到2013-08-的月数 - 21并将输出放在适当的行中。
我认为dplyr或应用函数会派上用场,但我发现我正在制作相当多的代码,这些代码应该不会太难。
感谢您的帮助。
答案 0 :(得分:2)
在我的回答中,我使用lubridate
包来确保s
中的df
列不被视为字符串或因素:
library(dplyr)
library(lubridate)
df$s = as_date(df$s)
为开始日期创建单独的数据框:
df.startdate = df %>% group_by(n) %>% summarise(start_date = min(s))
现在将主df
合并到新构建的df.startdate
:
answer = merge(df, df.startdate, by = "n") %>%
mutate(output = interval(start_date, s) %/% months(1))
答案 1 :(得分:2)
我们可以使用dplyr
:
n = c("john","jane","tim","john","jimmy","tim","jane","john","jimmy")
s = c("2012-03-21","2013-02-12","2014-01-01","2012-05-21","2010-12-17","2012-01-21","2013-03-12","2013-08-21","2010-09-17")
s = as.Date(s)
df = data.frame(n,s)
library(dplyr)
df %>%
group_by(n) %>%
mutate(out = round(as.integer(difftime(s, s[which.min(s)], units = 'days')) / 30, 0))
#> # A tibble: 9 x 3
#> # Groups: n [4]
#> n s out
#> <fctr> <date> <dbl>
#> 1 john 2012-03-21 0
#> 2 jane 2013-02-12 0
#> 3 tim 2014-01-01 24
#> 4 john 2012-05-21 2
#> 5 jimmy 2010-12-17 3
#> 6 tim 2012-01-21 0
#> 7 jane 2013-03-12 1
#> 8 john 2013-08-21 17
#> 9 jimmy 2010-09-17 0
一如既往,计算月数非常棘手,因为不同的月份有不同的长度。