我的数据是这样的:
df <- data.frame(Id=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,9,9,9,9),Date=c("2013-04","2013-12","2013-01","2013-12","2013-11",
"2013-12","2012-04","2013-12","2012-08","2014-12","2013-08","2014-12","2013-08","2014-12","2011-01","2013-11","2013-12","2014-01","2014-04"))
要获得正确的格式:
df$Date <- paste0(df$Date,"-01")
我只需要获取years
,这样每个id就会包含2个日期。
我是否对现有数据这样做:
require(lubridate)
df$Date <- year(as.Date(df$Date)-days(1))
对于给定的id
,我有时会得到相同的日期。
列Date
的所需输出是:
2012 2013 2012 2013 2012 2013 2012 2013 2013 2014 2013 2014 2013 2014 2011 2013 2014
请注意,给定id
的最后日期始终正确,因此必须根据上一日期更正前一年。日期必须采用可以转换为年份的格式,如图所示。
编辑以下是这种情况:
Id Date
1 2013-11-01
1 2013-12-01
1 2014-01-01
1 2014-04-01
现在我收到了这个:2012,2013,2013,2013
我需要:2012,2013,2013,2014
答案 0 :(得分:4)
这就是我使用data.table
包来解决这个问题的方法(虽然看起来对我来说太复杂了)
library(data.table)
setDT(df)[, year := year(Date)][,
year := if(.N == 2) (year[2] - 1):year[2] else year,
Id][]
# Id Date year indx
# 1: 1 2013-04-01 2012 2
# 2: 1 2013-12-01 2013 2
# 3: 2 2013-01-01 2012 2
# 4: 2 2013-12-01 2013 2
# 5: 3 2013-11-01 2012 2
# 6: 3 2013-12-01 2013 2
# 7: 4 2012-04-01 2012 2
# 8: 4 2013-12-01 2013 2
# 9: 5 2012-08-01 2013 2
# 10: 5 2014-12-01 2014 2
# 11: 6 2013-08-01 2013 2
# 12: 6 2014-12-01 2014 2
# 13: 7 2013-08-01 2013 2
# 14: 7 2014-12-01 2014 2
# 15: 8 2011-01-01 2011 1
或者一步到位(感谢@Arun提供此功能):
setDT(df)[, year := {tmp = year(Date);
if (.N == 2L) (tmp[2]-1L):tmp[2] else tmp},
Id]
修改强>: 根据OP的新数据,我们可以通过添加额外的索引来修改代码
setDT(df)[, indx := if(.N > 2) rep(seq_len(.N/2), each = 2) + 1L else .N, Id]
df[, year := {tmp = year(Date); if (.N > 1L) (tmp[2] - 1L):tmp[2] else tmp},
list(Id, indx)][]
# Id Date indx year
# 1: 1 2013-04-01 2 2012
# 2: 1 2013-12-01 2 2013
# 3: 2 2013-01-01 2 2012
# 4: 2 2013-12-01 2 2013
# 5: 3 2013-11-01 2 2012
# 6: 3 2013-12-01 2 2013
# 7: 4 2012-04-01 2 2012
# 8: 4 2013-12-01 2 2013
# 9: 5 2012-08-01 2 2013
# 10: 5 2014-12-01 2 2014
# 11: 6 2013-08-01 2 2013
# 12: 6 2014-12-01 2 2014
# 13: 7 2013-08-01 2 2013
# 14: 7 2014-12-01 2 2014
# 15: 8 2011-01-01 1 2011
# 16: 9 2013-11-01 2 2012
# 17: 9 2013-12-01 2 2013
# 18: 9 2014-01-01 3 2013
# 19: 9 2014-04-01 3 2014
或@akrun提供的另一种可能的解决方案
setDT(df)[, `:=`(year = year(Date), indx = .N, indx2 = as.numeric(gl(.N,2, .N))), Id]
df[indx > 1, year:=(year[2]-1):year[2], list(Id, indx2)][]
答案 1 :(得分:3)
使用{@ 1}}使用与@David Arenburg相似的方法
dplyr
或使用library(dplyr)
df %>%
group_by(Id) %>%
mutate(year=as.numeric(sub('-.*', '', Date)),
year=replace(year, n()>1, c(year[2]-1, year[2])))
# Id Date year
#1 1 2013-04 2012
#2 1 2013-12 2013
#3 2 2013-01 2012
#4 2 2013-12 2013
#5 3 2013-11 2012
#6 3 2013-12 2013
#7 4 2012-04 2012
#8 4 2013-12 2013
#9 5 2012-08 2013
#10 5 2014-12 2014
#11 6 2013-08 2013
#12 6 2014-12 2014
#13 7 2013-08 2013
#14 7 2014-12 2014
#15 8 2011-01 2011
base R
你可以尝试
with(df, ave(as.numeric(sub('-.*', '', Date)), Id,
FUN=function(x) if(length(x)>1)(x[2]-1):x[2] else x))
#[1] 2012 2013 2012 2013 2012 2013 2012 2013 2013 2014 2013 2014 2013 2014 2011
或者
df$indx <- with(df, ave(Id, Id, FUN=function(x) (seq_along(x)-1)%/%2+1))
with(df, ave(as.numeric(sub('-.*', '', Date)), Id, indx,
FUN=function(x) if(length(x)>1)(x[2]-1):x[2] else x))
#[1] 2012 2013 2012 2013 2012 2013 2012 2013 2013 2014 2013 2014 2013 2014 2011
#[16] 2012 2013 2013 2014
答案 2 :(得分:2)
这是一个dplyr
解决方案。您可以删除中间字段last_year
和year2
,但为了清楚起见,我将其留在此处:
library(stringr)
library(dplyr)
df %>%
group_by(Id) %>%
mutate(
last_year = last(as.integer(str_sub(Date, 1, 4))),
year2 = row_number() - n(),
year = last_year + year2
)