通过在R中的id纠正前一年

时间:2014-12-31 14:04:36

标签: r

我的数据是这样的:

df <- data.frame(Id=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,9,9,9,9),Date=c("2013-04","2013-12","2013-01","2013-12","2013-11",
             "2013-12","2012-04","2013-12","2012-08","2014-12","2013-08","2014-12","2013-08","2014-12","2011-01","2013-11","2013-12","2014-01","2014-04"))

要获得正确的格式:

df$Date <- paste0(df$Date,"-01")

我只需要获取years,这样每个id就会包含2个日期。

我是否对现有数据这样做:

require(lubridate)
df$Date <- year(as.Date(df$Date)-days(1))

对于给定的id,我有时会得到相同的日期。

Date的所需输出是:

 2012 2013 2012 2013 2012 2013 2012 2013 2013 2014 2013 2014 2013 2014 2011 2013 2014

请注意,给定id的最后日期始终正确,因此必须根据上一日期更正前一年。日期必须采用可以转换为年份的格式,如图所示。

编辑以下是这种情况:

Id Date 
1 2013-11-01    
1 2013-12-01     
1 2014-01-01    
1 2014-04-01

现在我收到了这个:2012,2013,2013,2013

我需要:2012,2013,2013,2014

3 个答案:

答案 0 :(得分:4)

这就是我使用data.table包来解决这个问题的方法(虽然看起来对我来说太复杂了)

library(data.table)
setDT(df)[, year := year(Date)][, 
            year := if(.N == 2) (year[2] - 1):year[2] else year,
            Id][]    

#     Id       Date year indx
#  1:  1 2013-04-01 2012    2
#  2:  1 2013-12-01 2013    2
#  3:  2 2013-01-01 2012    2
#  4:  2 2013-12-01 2013    2
#  5:  3 2013-11-01 2012    2
#  6:  3 2013-12-01 2013    2
#  7:  4 2012-04-01 2012    2
#  8:  4 2013-12-01 2013    2
#  9:  5 2012-08-01 2013    2
# 10:  5 2014-12-01 2014    2
# 11:  6 2013-08-01 2013    2
# 12:  6 2014-12-01 2014    2
# 13:  7 2013-08-01 2013    2
# 14:  7 2014-12-01 2014    2
# 15:  8 2011-01-01 2011    1

或者一步到位(感谢@Arun提供此功能):

setDT(df)[, year := {tmp = year(Date); 
            if (.N == 2L) (tmp[2]-1L):tmp[2] else tmp},
            Id]

修改: 根据OP的新数据,我们可以通过添加额外的索引来修改代码

setDT(df)[, indx := if(.N > 2) rep(seq_len(.N/2), each = 2) + 1L else .N, Id] 
df[, year := {tmp = year(Date); if (.N > 1L) (tmp[2] - 1L):tmp[2] else tmp}, 
     list(Id, indx)][]
#     Id       Date indx year
#  1:  1 2013-04-01    2 2012
#  2:  1 2013-12-01    2 2013
#  3:  2 2013-01-01    2 2012
#  4:  2 2013-12-01    2 2013
#  5:  3 2013-11-01    2 2012
#  6:  3 2013-12-01    2 2013
#  7:  4 2012-04-01    2 2012
#  8:  4 2013-12-01    2 2013
#  9:  5 2012-08-01    2 2013
# 10:  5 2014-12-01    2 2014
# 11:  6 2013-08-01    2 2013
# 12:  6 2014-12-01    2 2014
# 13:  7 2013-08-01    2 2013
# 14:  7 2014-12-01    2 2014
# 15:  8 2011-01-01    1 2011
# 16:  9 2013-11-01    2 2012
# 17:  9 2013-12-01    2 2013
# 18:  9 2014-01-01    3 2013
# 19:  9 2014-04-01    3 2014

或@akrun提供的另一种可能的解决方案

setDT(df)[, `:=`(year = year(Date), indx = .N, indx2 = as.numeric(gl(.N,2, .N))), Id]
df[indx > 1, year:=(year[2]-1):year[2], list(Id, indx2)][]

答案 1 :(得分:3)

使用{@ 1}}使用与@David Arenburg相似的方法

dplyr

或使用library(dplyr) df %>% group_by(Id) %>% mutate(year=as.numeric(sub('-.*', '', Date)), year=replace(year, n()>1, c(year[2]-1, year[2]))) # Id Date year #1 1 2013-04 2012 #2 1 2013-12 2013 #3 2 2013-01 2012 #4 2 2013-12 2013 #5 3 2013-11 2012 #6 3 2013-12 2013 #7 4 2012-04 2012 #8 4 2013-12 2013 #9 5 2012-08 2013 #10 5 2014-12 2014 #11 6 2013-08 2013 #12 6 2014-12 2014 #13 7 2013-08 2013 #14 7 2014-12 2014 #15 8 2011-01 2011

base R

更新

你可以尝试

with(df, ave(as.numeric(sub('-.*', '', Date)), Id, 
     FUN=function(x) if(length(x)>1)(x[2]-1):x[2] else x))

#[1] 2012 2013 2012 2013 2012 2013 2012 2013 2013 2014 2013 2014 2013 2014 2011

或者

df$indx <- with(df, ave(Id, Id, FUN=function(x) (seq_along(x)-1)%/%2+1))

with(df, ave(as.numeric(sub('-.*', '', Date)), Id, indx, 
         FUN=function(x) if(length(x)>1)(x[2]-1):x[2] else x)) 
#[1] 2012 2013 2012 2013 2012 2013 2012 2013 2013 2014 2013 2014 2013 2014 2011
#[16] 2012 2013 2013 2014

答案 2 :(得分:2)

这是一个dplyr解决方案。您可以删除中间字段last_yearyear2,但为了清楚起见,我将其留在此处:

library(stringr)
library(dplyr)

df %>%
  group_by(Id) %>%
  mutate(
    last_year = last(as.integer(str_sub(Date, 1, 4))),
    year2 = row_number() - n(),
    year = last_year + year2
    )